This section is intended to lay the groundwork of all of the most common data analysis tasks we face when tackling our empirical problems.
Reading .csv Files
The most common format of file you’re like to find in the real world is in .csv (comma-separated value) format.
Quick clarification that not all comma-separated files have the extension. Often you’ll find a file with the extension .txt that actually contains text that is comma-separated. You just have to inspect the file to figure out how it’s stored.
In fact that’s the case for or first example. We’re going to look at teacher attendance data for the School District of Philadelphia.
The school district releases, through their Open Data Initiative, a whole bunch of data files related to the topics like school catchment zones, employee salaries, budget details, etc. All of the data can be found here: http://webgui.phila.k12.pa.us/offices/o/open-data-initiative.
We could use R to download the files (we may touch on this in the afternoon), but to ease us into things, I’ve gone ahead and downloaded some files and put them on my GitHub page. Let’s look at the data from 2013-14. Here’s a link to the data, but it should also be loaded on your machine.
We can preview the file using the terminal on a Mac. Learning to do some basic stuff in the terminal is a great investment of your time, but that’s not the focus of today’s lesson, so we’ll just stick with doing things through R. We can send commands to the terminal in R using the system command. The terminal command head (which is also a function in R, but only applies to objects already loaded) will spit out the first couple lines of a file so we can see what it looks like without having to load the whole thing (which, for larger files, can be time-consuming).
#Just to flex some of R's muscles, we'll find
# the data file from within R. As often as not,
# we'll just navigate to the file ourselves and
# just copy-paste the file path.
# We're looking for the folder with the data
# so we use include.dirs to include directories
list.files(include.dirs = TRUE)
## [1] "data"
## [2] "iesrtutorial_cache"
## [3] "iesrtutorial_files"
## [4] "iesrtutorial.html"
## [5] "iesrtutorial.Rmd"
## [6] "README.md"
## [7] "workshop_installation_instructions.pdf"
#I've kept things in the data folder here. We
# can tell it's a folder as it has no extensions.
# We can find the files in that folder using a
# relative path (./ means starting from the
# current folder, enter the folder after /).
# We use the full.names of the file for the next step.
list.files("./data", full.names = TRUE)
## [1] "./data/00keys.csv"
## [2] "./data/00staff.dat"
## [3] "./data/2015-2016 Enr Dem (Suppressed).xls"
## [4] "./data/ag131a_supp.sas7bdat"
## [5] "./data/carsdata.dta"
## [6] "./data/employee_information.csv"
## [7] "./data/README_SDP_Employee.txt"
## [8] "./data/School Profiles Teacher Attendance 2013-2014.TXT"
#Now we see the data files. The one we're after
# currently is "School Profiles Teacher Attendance 2013-2014.TXT"
# To get it previewed, we send the command head to the system
# and follow it with the quoted file name (since the
# file name has spaces, which is a pain)
command = paste0('head ', '"./data/School Profiles Teacher Attendance 2013-2014.TXT"')
system(command)
## [1] "/home/michael/Documents/iesrtutorial"
## ULCS_NO,SCHOOL_YEAR,SCH_TEACHER_ATTEND,SDP_TEACHER_ATTEND,SCHOOL_ID
## "8350","2013-2014",92.7,93.4,"835"
## "7320","2013-2014",95,93.4,"732"
## "5440","2013-2014",96,93.4,"544"
## "8320","2013-2014",95.1,93.4,"832"
## "6440","2013-2014",96.6,93.4,"644"
## "4370","2013-2014",93.8,93.4,"437"
## "4030","2013-2014",94.3,93.4,"403"
## "2240","2013-2014",94,93.4,"224"
## "6450","2013-2014",95.3,93.4,"645"
We can see that even though the data has extension .txt, the file itself is very regular and clearly comma-separated.
fread/data.table
To read in this data, I’m going to eschew the standard intro-to-R suggestion to use read.csv because we’re presented now with the perfect opportunity to introduce to you the crown jewel of R – data.table (full disclosure, I am an active contributor to it, but this has followed from my love for it).
data.table is a package created by Matthew Dowle and actively maintained by him, Arun Srinivasan, and Jan Gorecki. Basically, it’s designed to make working with data in R very straightforward/intuitive and incredibly fast (since it’s designed to handle data with tens of millions of observations). We’ll be spending a lot of time working with data.table today, and picking most of it up as we go along. You can read more on the data.table homepage when you get a chance.
The first and most famous tool of data.table is fread. The “f” means FAST. fread is a highly optimized tool for reading data into R at blitzing fast speeds. You’ll barely notice the difference for typical files (less than 100,000 observations), but when you do notice the difference, it’s a sight to behold.
On your own machine, you’ll need to install data.table (luckily for us, it’s already been pre-installed on your machines for today) using:
install.packages("data.table")
#the more confident and ambitious can install
# the development version of the package,
# which tends to have more features, but may
# occasionally be more error-prone.
# Caveat coder.
install.packages("data.table", type = "source",
repos = "https://Rdatatable.github.io/data.table")
Once we’re sure it’s on our machine, we can read the data by running:
#load the data.table package
library(data.table)
attendance <-
fread("./data/School Profiles Teacher Attendance 2013-2014.TXT")
Just like that!
Entering the data variable by itself will invoke the print.data.table command and display a preview of the data that was loaded:
attendance
## ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1: 8350 2013-2014 92.7 93.4 835
## 2: 7320 2013-2014 95.0 93.4 732
## 3: 5440 2013-2014 96.0 93.4 544
## 4: 8320 2013-2014 95.1 93.4 832
## 5: 6440 2013-2014 96.6 93.4 644
## ---
## 208: 5480 2013-2014 92.0 93.4 548
## 209: 1300 2013-2014 91.4 93.4 130
## 210: 6090 2013-2014 95.3 93.4 609
## 211: 1050 2013-2014 94.7 93.4 105
## 212: 4300 2013-2014 94.7 93.4 430
#we can also explicitly invoke print,
# which has a few more bells and whistles to
# facilitate understanding what our data.table
# looks like.
print(attendance, class = TRUE)
## ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## <char> <char> <num> <num> <char>
## 1: 8350 2013-2014 92.7 93.4 835
## 2: 7320 2013-2014 95.0 93.4 732
## 3: 5440 2013-2014 96.0 93.4 544
## 4: 8320 2013-2014 95.1 93.4 832
## ---
## 208: 5480 2013-2014 92.0 93.4 548
## 209: 1300 2013-2014 91.4 93.4 130
## 210: 6090 2013-2014 95.3 93.4 609
## 211: 1050 2013-2014 94.7 93.4 105
## 212: 4300 2013-2014 94.7 93.4 430
Using the class argument to print, the preview now includes an abbreviated type under each of the column headers letting you know what class R thinks each column is (note class is a recent update to data.table; for the moment, it is not available in the version on CRAN, but should be available there soon).
What about attendance itself? It looks like something we’ve not seen yet. Let’s explore:
class(attendance)
## [1] "data.table" "data.frame"
str(attendance)
## Classes 'data.table' and 'data.frame': 212 obs. of 5 variables:
## $ ULCS_NO : chr "8350" "7320" "5440" "8320" ...
## $ SCHOOL_YEAR : chr "2013-2014" "2013-2014" "2013-2014" "2013-2014" ...
## $ SCH_TEACHER_ATTEND: num 92.7 95 96 95.1 96.6 93.8 94.3 94 95.3 92.2 ...
## $ SDP_TEACHER_ATTEND: num 93.4 93.4 93.4 93.4 93.4 93.4 93.4 93.4 93.4 93.4 ...
## $ SCHOOL_ID : chr "835" "732" "544" "832" ...
## - attr(*, ".internal.selfref")=<externalptr>
is.list(attendance)
## [1] TRUE
OK, lots of important stuff here. We’ve discovered a new type of object! attendance is a data.table. You can think of this as being a single sheet of an Excel workbook. It is specific to the data.table package, and is an improvement on the data.frame type found in base R (we’ll see this type a bit later).
As far as how R thinks of it, a data.table is a list on steroids. Each element of the list is a column, and each element has the same number of items. We can see from str that, like a list, we can use $ to extract items (here: columns):
attendance$SCH_TEACHER_ATTEND
## [1] 92.7 95.0 96.0 95.1 96.6 93.8 94.3 94.0 95.3 92.2 90.8 94.3 93.6 95.0
## [15] 92.8 85.3 86.4 95.9 93.0 94.8 90.1 95.0 93.4 94.4 91.8 94.6 95.4 92.3
## [29] 95.8 94.0 95.1 94.7 88.4 93.7 94.6 95.5 94.8 95.3 94.0 95.3 89.9 93.2
## [43] 91.7 89.0 93.1 92.7 94.2 97.2 93.6 92.4 92.6 92.7 92.0 94.4 94.3 93.6
## [57] 92.4 95.1 93.4 95.6 87.8 94.7 93.5 95.7 90.9 94.4 95.7 90.2 95.7 95.2
## [71] 93.3 93.5 89.0 95.0 89.7 93.9 93.2 93.5 92.1 93.0 94.8 95.1 94.8 90.8
## [85] 95.7 93.3 92.8 89.6 95.4 91.5 94.1 93.3 93.7 96.8 95.3 94.1 93.9 88.0
## [99] 94.6 95.1 86.9 95.5 95.0 97.6 95.8 93.8 94.4 96.0 95.8 89.7 94.3 92.7
## [113] 93.1 95.1 94.5 95.4 94.4 94.4 90.8 94.5 93.3 90.6 93.5 94.4 91.9 90.7
## [127] 91.6 92.1 94.7 95.4 86.5 94.8 94.5 90.1 96.1 93.1 95.8 92.0 94.1 91.7
## [141] 94.0 92.4 95.6 91.2 92.1 92.8 93.8 93.7 88.8 96.1 94.4 90.6 95.3 95.3
## [155] 90.9 94.0 94.1 95.6 93.9 91.9 91.5 93.2 94.2 93.0 91.2 91.8 95.9 92.1
## [169] 93.8 92.6 90.0 94.1 96.5 91.1 89.8 93.6 94.3 97.6 93.0 98.1 91.4 93.4
## [183] 94.5 94.1 94.1 88.7 94.0 93.4 94.2 94.6 96.2 91.4 92.0 96.5 92.0 95.1
## [197] 94.7 96.6 96.8 91.7 96.1 95.9 93.0 91.5 94.4 93.2 88.5 92.0 91.4 95.3
## [211] 94.7 94.7
Exploring Attendance Data
What are the columns in this data? We usually hope we have a data dictionary, but I didn’t find one on the SDP website. It appears that ULCS_NO is the same as SCHOOL_ID (padded with a 0). It’s not clear if the school district is anonymizing the schools or if we simply have to match the SCHOOL_ID to the school’s name in another file.
The meat of the data are the two columns SCH_TEACHER_ATTEND and SDP_TEACHER_ATTEND. The former is the attendance rate of teachers at a given school; the latter is the average for the whole district (which is why it’s the same in every row).
We can now start to explore the data. The only real column of interest is SCH_TEACHER_ATTEND. With this, we can explore how teacher attendance varies across the city.
Interacting with a data.table for exploration involves accessing it with []. Within [], there are three main arguments:
i: which rows do we want? (subsetting)
j: which columns do we want? (selection and manipulation)
by: grouping our operations in j by categories within a column.
The i and j notation come from thinking of a data.table as a matrix. To express the ith row and jth column of a matrix A, math people often write A[i,j].
Let’s see the first two in action in exploring our first data set
#Summarize the variable to get a quick understanding
#we're looking at ALL rows, so we leave the first argument BLANK
#we're trying to summarize SCH_TEACHER_ATTEND, so we
# use the summary function in the second argument
attendance[ , summary(SCH_TEACHER_ATTEND)]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 85.30 92.10 93.90 93.43 95.00 98.10
So most schools are within 2 percent of the district average. What about schools that are particularly high or particularly low?
#We are focusing on certain rows, so we
# say the condition defining those
# rows in the first argument
attendance[SCH_TEACHER_ATTEND < 90]
## ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1: 4350 2013-2014 85.3 93.4 435
## 2: 4570 2013-2014 86.4 93.4 457
## 3: 1340 2013-2014 88.4 93.4 134
## 4: 7350 2013-2014 89.9 93.4 735
## 5: 7220 2013-2014 89.0 93.4 722
## 6: 2480 2013-2014 87.8 93.4 248
## 7: 6360 2013-2014 89.0 93.4 636
## 8: 8240 2013-2014 89.7 93.4 824
## 9: 5430 2013-2014 89.6 93.4 543
## 10: 6260 2013-2014 88.0 93.4 626
## 11: 6340 2013-2014 86.9 93.4 634
## 12: 7290 2013-2014 89.7 93.4 729
## 13: 5370 2013-2014 86.5 93.4 537
## 14: 1330 2013-2014 88.8 93.4 133
## 15: 7510 2013-2014 89.8 93.4 751
## 16: 4240 2013-2014 88.7 93.4 424
## 17: 1130 2013-2014 88.5 93.4 113
#And schools with particularly high attendance
attendance[SCH_TEACHER_ATTEND > 97]
## ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1: 5340 2013-2014 97.2 93.4 534
## 2: 6040 2013-2014 97.6 93.4 604
## 3: 8560 2013-2014 97.6 93.4 856
## 4: 1190 2013-2014 98.1 93.4 119
Without knowing more about the schools, all we can say is that there are a certain number of schools with very high or very low attendance.
Teacher Salaries
Another file posted to the SDP website contains information on teachers’ salaries. I’ve again downloaded this data already for you, but if you’re curious here’s a link to the data from the website.
This one actually comes with a data dictionary of sorts in the form of a README file. We can use R to display the README like so:
writeLines(readLines("./data/README_SDP_Employee.txt"))
README_SDP_Employee.txt
REVISIONS
20160401 (April 1st, 2016) - Second release in 2016.
This README file contains some background information and guidance on School District of Philadelphia (SDP) data. This set of data was extracted on March 31st 2016, and includes every active employee of the School District of Philadelphia as of the extract date. Do not hesitate to contact us if you have any questions, or would like to request other data sets.
Happy data diving!
opendata@philasd.org
Individual files:
README_SDP_EMPLOYEE.TXT - You're soaking in it!
employee_information.csv
Provides basic information on every employee of the School District of Philadelphia.
LAST_NAME - Employee's last name.
FIRST_NAME - Employee's first name.
PAY_RATE_TYPE - The pay rate type for the employee, either SALARIED, DAILY or HOURLY.
PAY_RATE - The pay rate for the employee. If the pay rate type is salaried, then the annual salary (minus any benefits). If the pay rate type is daily, then the daily pay, and if pay rate type is hourly, then the hourly pay rate. If the pay rate is zero the employee is typically in a leave status that does not receive pay; for example military leave.
TITLE_DESCRIPTION - Title for employee; there are over 500 unique titles.
HOME_ORGANIZATION - The home organization code identifies where the employee is primarily stationed. Some employees are "per diem", meaning they work on a daily basis at a variety of locations; other employees may work at multiple locations, but are attributed to one location.
HOME_ORGANIZATION_DESCRIPTION - The home organization description identifies the home location by name, typically the name of the school or office.
ORGANIZATION_LEVEL - Identifies the type of location the employee works at; for example, administrative office, garage, elementary school, etc.
TYPE_OF_REPRESENTATION - The union representation for the employee. NON REP means the employee is not represented by any union.
GENDER - Gender of employee.
RUN_DATE - The date the data was extracted from SDP data systems.
School District Data Sets Terms of Use.pdf
Terms of use for use the open data sets from the School District of Philadelphia.
readLines is a function used to bring in text files to R objects. It returns every line of the file as one element of a character vector. writeLines is typically used for creating text files, but if we don’t tell it a file to use, it prints the output to the console.
The file describes all of three of the files that came with the download; of importance is employee_imformation.csv. We see here the columns are all described, which will help us understand what we see when we load in the data. Let’s do it!
salaries <- fread("./data/employee_information.csv")
print(salaries, class = TRUE)
## LAST_NAME FIRST_NAME PAY_RATE_TYPE PAY_RATE
## <char> <char> <char> <int>
## 1: AARON ANDREA SALARIED 31261
## 2: AARON PEGGY SALARIED 9349
## 3: ABARY RODNEY SALARIED 76461
## 4: ABATE JO-ANN HOURLY 48
## ---
## 17558: ZWICK KEVIN SALARIED 46694
## 17559: ZWIRN RHONDA SALARIED 76461
## 17560: ZWOLAK PAUL SALARIED 54364
## 17561: ZYCHOWICZ LISA SALARIED 67706
## 17562: ZYGMUND-FELT ALLYN SALARIED 90051
## TITLE_DESCRIPTION HOME_ORGANIZATION
## <char> <char>
## 1: GENERAL CLEANER, 8 HOURS 4300
## 2: STUDENT CLIMATE STAFF,4 HOURS 6360
## 3: SCHOOL NURSE 2370
## 4: TEACHER-EXTRA CURR/STAFF DEVEL 9EW0
## ---
## 17558: TEACHER,FULL TIME 8300
## 17559: TEACHER,SPEC EDUCATION 6210
## 17560: TEACHER,FULL TIME 2290
## 17561: TEACHER,FULL TIME 8420
## 17562: TEACHER,FULL TIME 7100
## HOME_ORGANIZATION_DESCRIPTION ORGANIZATION_LEVEL
## <char> <char>
## 1: HESTON, EDWARD SCHOOL ELEMENTARY SCHOOL
## 2: ROOSEVELT ELEMENTARY SCHOOL ELEMENTARY SCHOOL
## 3: MCDANIEL, DELAPLAINE SCHOOL ELEMENTARY SCHOOL
## 4: NON-PUBLIC PROGRAMS NON ADMINISTRATIVE OFFICE
## ---
## 17558: MAYFAIR SCHOOL ELEMENTARY SCHOOL
## 17559: EDMONDS, FRANKLIN S. SCHOOL ELEMENTARY SCHOOL
## 17560: FRANKLIN LEARNING CENTER HIGH SCHOOL
## 17561: DECATUR, STEPHEN SCHOOL ELEMENTARY SCHOOL
## 17562: COOKE, JAY ELEMENTARY SCHOOL ELEMENTARY SCHOOL
## TYPE_OF_REPRESENTATION GENDER RUN_DATE
## <char> <char> <char>
## 1: LOCAL 1201 F 4/1/2016
## 2: LOCAL 634 F 4/1/2016
## 3: PFT-TEACHER M 4/1/2016
## 4: PFT-TEACHER F 4/1/2016
## ---
## 17558: PFT-TEACHER M 4/1/2016
## 17559: PFT-TEACHER F 4/1/2016
## 17560: PFT-TEACHER M 4/1/2016
## 17561: PFT-TEACHER F 4/1/2016
## 17562: PFT-TEACHER F 4/1/2016
#it's a pain to have to lay on the SHIFT key all the
# time, so let's rename the columns so that
# are in lower case. We need three functions:
# tolower, which takes all upper-case letters in a
# string and replaces them with their lower-case
# counterpart; names, which returns the names of
# any object (vector, list, data.table, etc.);
# and setnames, which has three arguments:
# 1) a data.table, 2) the columns to rename, and
# 3) their replacements. If we only use two arguments,
# R assumes we want to replace _all_ column names, and
# that the second argument is the replacement values.
setnames(salaries, tolower(names(salaries)))
#checking the result:
salaries
## last_name first_name pay_rate_type pay_rate
## 1: AARON ANDREA SALARIED 31261
## 2: AARON PEGGY SALARIED 9349
## 3: ABARY RODNEY SALARIED 76461
## 4: ABATE JO-ANN HOURLY 48
## 5: ABAYOMI-IGE OLABIMPE SALARIED 76461
## ---
## 17558: ZWICK KEVIN SALARIED 46694
## 17559: ZWIRN RHONDA SALARIED 76461
## 17560: ZWOLAK PAUL SALARIED 54364
## 17561: ZYCHOWICZ LISA SALARIED 67706
## 17562: ZYGMUND-FELT ALLYN SALARIED 90051
## title_description home_organization
## 1: GENERAL CLEANER, 8 HOURS 4300
## 2: STUDENT CLIMATE STAFF,4 HOURS 6360
## 3: SCHOOL NURSE 2370
## 4: TEACHER-EXTRA CURR/STAFF DEVEL 9EW0
## 5: TEACHER,SPEC EDUCATION 6100
## ---
## 17558: TEACHER,FULL TIME 8300
## 17559: TEACHER,SPEC EDUCATION 6210
## 17560: TEACHER,FULL TIME 2290
## 17561: TEACHER,FULL TIME 8420
## 17562: TEACHER,FULL TIME 7100
## home_organization_description organization_level
## 1: HESTON, EDWARD SCHOOL ELEMENTARY SCHOOL
## 2: ROOSEVELT ELEMENTARY SCHOOL ELEMENTARY SCHOOL
## 3: MCDANIEL, DELAPLAINE SCHOOL ELEMENTARY SCHOOL
## 4: NON-PUBLIC PROGRAMS NON ADMINISTRATIVE OFFICE
## 5: LEEDS, MORRIS E. MIDDLE SCHOOL MIDDLE SCHOOL
## ---
## 17558: MAYFAIR SCHOOL ELEMENTARY SCHOOL
## 17559: EDMONDS, FRANKLIN S. SCHOOL ELEMENTARY SCHOOL
## 17560: FRANKLIN LEARNING CENTER HIGH SCHOOL
## 17561: DECATUR, STEPHEN SCHOOL ELEMENTARY SCHOOL
## 17562: COOKE, JAY ELEMENTARY SCHOOL ELEMENTARY SCHOOL
## type_of_representation gender run_date
## 1: LOCAL 1201 F 4/1/2016
## 2: LOCAL 634 F 4/1/2016
## 3: PFT-TEACHER M 4/1/2016
## 4: PFT-TEACHER F 4/1/2016
## 5: PFT-TEACHER F 4/1/2016
## ---
## 17558: PFT-TEACHER M 4/1/2016
## 17559: PFT-TEACHER F 4/1/2016
## 17560: PFT-TEACHER M 4/1/2016
## 17561: PFT-TEACHER F 4/1/2016
## 17562: PFT-TEACHER F 4/1/2016
This is a much bigger data set – covering all employees of the SDP and giving many more details about them. Let’s explore a bit:
#table is a great function for describing discrete variables.
# it gives the count of observations in each cell, and can also
# be used to create what Stata calls two-way tables.
salaries[ , table(pay_rate_type)]
## pay_rate_type
## DAILY HOURLY SALARIED
## 141 526 16895
salaries[ , table(organization_level)]
## organization_level
## ACADEMY ADMINISTRATIVE OFFICE
## 4 1578
## CAREER AND TECHNICAL HIGH SCHL CHARTER SCHOOLS
## 456 12
## DISTRICT BUILDING/PROPERTY EARLY CHILDHOOD
## 64 356
## ELEMENTARY SCHOOL GARAGE
## 9667 6
## HIGH SCHOOL INACTIVE WITH ACTIVITY
## 3101 11
## LEARNING NETWORK MIDDLE SCHOOL
## 20 954
## NON ADMINISTRATIVE OFFICE NON PUBLIC SCHOOL
## 1191 1
## TRANSITION / OVERAGE SCHOOL
## 141
#a cross-tabulation of these two; the first argument will
# appear in the rows, the second in the columns
salaries[ , table(pay_rate_type, organization_level)]
## organization_level
## pay_rate_type ACADEMY ADMINISTRATIVE OFFICE CAREER AND TECHNICAL HIGH SCHL
## DAILY 2 31 4
## HOURLY 0 147 0
## SALARIED 2 1400 452
## organization_level
## pay_rate_type CHARTER SCHOOLS DISTRICT BUILDING/PROPERTY EARLY CHILDHOOD
## DAILY 0 0 0
## HOURLY 0 0 2
## SALARIED 12 64 354
## organization_level
## pay_rate_type ELEMENTARY SCHOOL GARAGE HIGH SCHOOL INACTIVE WITH ACTIVITY
## DAILY 23 0 11 0
## HOURLY 0 0 2 0
## SALARIED 9644 6 3088 11
## organization_level
## pay_rate_type LEARNING NETWORK MIDDLE SCHOOL NON ADMINISTRATIVE OFFICE
## DAILY 2 1 65
## HOURLY 0 0 375
## SALARIED 18 953 751
## organization_level
## pay_rate_type NON PUBLIC SCHOOL TRANSITION / OVERAGE SCHOOL
## DAILY 0 2
## HOURLY 0 0
## SALARIED 1 139
salaries[ , table(organization_level, gender)]
## gender
## organization_level F M
## ACADEMY 1 3
## ADMINISTRATIVE OFFICE 911 667
## CAREER AND TECHNICAL HIGH SCHL 269 187
## CHARTER SCHOOLS 11 1
## DISTRICT BUILDING/PROPERTY 26 38
## EARLY CHILDHOOD 343 13
## ELEMENTARY SCHOOL 7989 1678
## GARAGE 0 6
## HIGH SCHOOL 1988 1113
## INACTIVE WITH ACTIVITY 5 6
## LEARNING NETWORK 7 13
## MIDDLE SCHOOL 706 248
## NON ADMINISTRATIVE OFFICE 597 594
## NON PUBLIC SCHOOL 1 0
## TRANSITION / OVERAGE SCHOOL 94 47
Let’s say we’re only interested in salaried employees. How do we get rid of the hourly/daily employees?
Unlike in Stata, where we’d use drop and would have to jump through hoops to recover the lost observations (in case we decided to expand our focus later in our analysis), we can simply create a new data.table in R that contain only the observations we care about. This is, in my opinion, one of the most salient examples of a feature of R that blows Stata out of the water. We can keep many data sets at our disposal at once, without having to unload/reload the ones we’re not currently using. This will prove especially helpful when it comes to data cleaning/merging/reshaping.
salaried <- salaries[pay_rate_type == "SALARIED"]
Even more useful may be to exclude anyone who’s not a teacher. First, we have to exclude anyone who’s not a teacher. How do we figure out who’s a teacher?
Approach:
- Count the number of observations in each category of
title_description (we can see from the preview that there are at least some teachers under TEACHER,FULL TIME)
- Sort these categories to find the most common categories, since presumably teachers are by far the most common SDP employees.
Carrying this out will introduce a few new things:
- Grouping. We’re going to group by
title_description, and count observations in each group. If I remember correctly, in Stata, this would be like by title_description: count. When data.table does this, you should think of your data.table being split into a whole bunch of smaller data.tables, each of which has all of the observations for exactly one level of title_description.
Here’s an illustration:

.N. data.table uses the abbreviation .N to stand for “number of observations”. It is defined locally, within each group, so all of the many data.tables mentioned above has its own .N. This is basically like count in Stata.
order. We can arrange the rows of a table by using order in the first argument (i); use a minus sign (-) to go in decreasing order, otherwise it’s increasing order (like we saw with sort).
- Chaining. We can tack on more and more
[] to any given data.table operation to hone or continue or adjust our outcomes beyond what we could do in a single call to []. The general syntax is DT[...][...][...][...], where each [...] consists of doing some manipulation/subset/aggregation, etc.
#Let's take it step-by-step
# 1: Find the count of employees in each category
salaried[ , .N, by = title_description]
## title_description N
## 1: GENERAL CLEANER, 8 HOURS 503
## 2: STUDENT CLIMATE STAFF,4 HOURS 451
## 3: SCHOOL NURSE 184
## 4: TEACHER,SPEC EDUCATION 1314
## 5: STUDENT CLIMATE STAFF,3 HOURS 410
## ---
## 531: DATA ANALYST - CHARTER SCHOOLS 1
## 532: DIR,OPERATIONAL SYS DEV 1
## 533: DIR,SYSTEMS ADMIN UNIT 1
## 534: DIR,INSURANCE RISK MANAGEMENT 1
## 535: DIR,DISTRICT PERFORMANCE OFF 1
# 2: reorder these to find the most common
salaried[ , .N, by = title_description][order(N)]
## title_description N
## 1: SCHOOL-BASED TECH MAINT ASST 1
## 2: DIR,CERT,SUB SVCS,SCH ALLOT SU 1
## 3: NUTRITIONIST, PKHS 1
## 4: ADMINISTRATOR,PHILA VIRTUAL AC 1
## 5: HR SYSTEMS CONTROL ANALYST 1
## ---
## 531: GENERAL CLEANER, 8 HOURS 503
## 532: CLASSROOM ASST,SP ED,SV HND 594
## 533: ONE TO ONE ASST, SPECIAL ED 757
## 534: TEACHER,SPEC EDUCATION 1314
## 535: TEACHER,FULL TIME 6652
# 3: put it in decreasing order
salaried[ , .N, by = title_description][order(-N)]
## title_description N
## 1: TEACHER,FULL TIME 6652
## 2: TEACHER,SPEC EDUCATION 1314
## 3: ONE TO ONE ASST, SPECIAL ED 757
## 4: CLASSROOM ASST,SP ED,SV HND 594
## 5: GENERAL CLEANER, 8 HOURS 503
## ---
## 531: DATA ANALYST - CHARTER SCHOOLS 1
## 532: DIR,OPERATIONAL SYS DEV 1
## 533: DIR,SYSTEMS ADMIN UNIT 1
## 534: DIR,INSURANCE RISK MANAGEMENT 1
## 535: DIR,DISTRICT PERFORMANCE OFF 1
# 4: Only look at the top 20 most frequent positions
# (idea -- once it's resorted, these are simply
# the first twenty rows of the sorted table)
salaried[ , .N, by = title_description][order(-N)][1:20]
## title_description N
## 1: TEACHER,FULL TIME 6652
## 2: TEACHER,SPEC EDUCATION 1314
## 3: ONE TO ONE ASST, SPECIAL ED 757
## 4: CLASSROOM ASST,SP ED,SV HND 594
## 5: GENERAL CLEANER, 8 HOURS 503
## 6: STUDENT CLIMATE STAFF,4 HOURS 451
## 7: STUDENT CLIMATE STAFF,3 HOURS 410
## 8: FOOD SVCS ASSISTANT 386
## 9: SUPPORTIVE SERVICES ASST, 4 HR 317
## 10: BUS ATTENDANT 295
## 11: SCHOOL POLICE OFFICER 264
## 12: SUPPORTIVE SERVICES ASST, 3 HR 249
## 13: SCHOOL COUNSELOR, 10 MONTHS 245
## 14: CUSTODIAL ASSISTANT 241
## 15: PRINCIPAL 225
## 16: SECRETARY I 224
## 17: SCHOOL NURSE 184
## 18: FOOD SVCS WORKER SENIOR 163
## 19: STUDENT CLIMATE STAFF,5 HOURS 145
## 20: BUS CHAUFFEUR PT (4-5HRS/DAY) 128
We can start to see many interesting facets of employment at SDP here. Almost 300 school police officers but only 180 nurses. Ripe ground for exploring on your own!
I’ll continue focusing on teachers; it appears all but the special ed. teachers are grouped under TEACHER,FULL TIME, so we can subset:
teachers <- salaried[title_description == "TEACHER,FULL TIME"]
teachers
## last_name first_name pay_rate_type pay_rate title_description
## 1: ABBOTT JOYCE SALARIED 76461 TEACHER,FULL TIME
## 2: ABDALLAH JUWAYRIYAH SALARIED 46694 TEACHER,FULL TIME
## 3: ABDEL-JALIL GHADEER SALARIED 45359 TEACHER,FULL TIME
## 4: ABDUL BASIT BARBARA SALARIED 67706 TEACHER,FULL TIME
## 5: ABDUL-LATEEF VILLIA SALARIED 48945 TEACHER,FULL TIME
## ---
## 6648: ZUNGOLO WILLIAM SALARIED 90051 TEACHER,FULL TIME
## 6649: ZWICK KEVIN SALARIED 46694 TEACHER,FULL TIME
## 6650: ZWOLAK PAUL SALARIED 54364 TEACHER,FULL TIME
## 6651: ZYCHOWICZ LISA SALARIED 67706 TEACHER,FULL TIME
## 6652: ZYGMUND-FELT ALLYN SALARIED 90051 TEACHER,FULL TIME
## home_organization home_organization_description organization_level
## 1: 1290 HAMILTON, ANDREW SCHOOL ELEMENTARY SCHOOL
## 2: 1470 LOCKE, ALAIN SCHOOL ELEMENTARY SCHOOL
## 3: 7440 TAYLOR, BAYARD SCHOOL ELEMENTARY SCHOOL
## 4: 7530 ROWEN, WILLIAM SCHOOL ELEMENTARY SCHOOL
## 5: 1010 BARTRAM, JOHN HIGH SCHOOL HIGH SCHOOL
## ---
## 6648: 8390 FITZPATRICK, A. L. SCHOOL ELEMENTARY SCHOOL
## 6649: 8300 MAYFAIR SCHOOL ELEMENTARY SCHOOL
## 6650: 2290 FRANKLIN LEARNING CENTER HIGH SCHOOL
## 6651: 8420 DECATUR, STEPHEN SCHOOL ELEMENTARY SCHOOL
## 6652: 7100 COOKE, JAY ELEMENTARY SCHOOL ELEMENTARY SCHOOL
## type_of_representation gender run_date
## 1: PFT-TEACHER F 4/1/2016
## 2: PFT-TEACHER F 4/1/2016
## 3: PFT-TEACHER F 4/1/2016
## 4: PFT-TEACHER F 4/1/2016
## 5: PFT-TEACHER F 4/1/2016
## ---
## 6648: PFT-TEACHER M 4/1/2016
## 6649: PFT-TEACHER M 4/1/2016
## 6650: PFT-TEACHER M 4/1/2016
## 6651: PFT-TEACHER F 4/1/2016
## 6652: PFT-TEACHER F 4/1/2016
Group mean salaries – various cuts
Now we can explore data by group. How does pay differ by gender?
#simple to code!
teachers[ , mean(pay_rate), by = gender]
## gender V1
## 1: F 68624.85
## 2: M 67689.59
#mean automatically got called V1. If we want
# a more friendly output, we have to change a little:
teachers[ , .(avg_wage = mean(pay_rate)), by = gender]
## gender avg_wage
## 1: F 68624.85
## 2: M 67689.59
So in teaching in Philly, actually women earn significantly more than men (this is indeed significant; I’ll show you how to test that in a bit)!
(Of course men and women with the same experience and certification are paid basically the same – it’s written explicitly in the teachers’ contract – but we don’t have any experience or certification measures to control for this)
How does pay differ by organization level?
teachers[ , .(avg_wage = mean(pay_rate)), by = organization_level]
## organization_level avg_wage
## 1: ELEMENTARY SCHOOL 67639.16
## 2: HIGH SCHOOL 69527.33
## 3: MIDDLE SCHOOL 69481.66
## 4: CAREER AND TECHNICAL HIGH SCHL 71931.52
## 5: EARLY CHILDHOOD 74058.15
## 6: TRANSITION / OVERAGE SCHOOL 63434.37
## 7: ADMINISTRATIVE OFFICE 66255.22
## 8: NON ADMINISTRATIVE OFFICE 82723.82
## 9: ACADEMY 67789.00
Which school has the best-paid teachers?
teachers[ , .(avg_wage = mean(pay_rate)),
by = home_organization_description
][order(-avg_wage)]
## home_organization_description avg_wage
## 1: LOGAN SCHOOL HEAD START 90051.00
## 2: COOK-WISSAHICKON HEAD START 90051.00
## 3: SHAWMONT BRIGHT FUTURES 90051.00
## 4: MUNOZ-MARIN HEAD START 90051.00
## 5: SHARSWOOD HEAD START 90051.00
## ---
## 287: DUNBAR, PAUL L. SCHOOL 56805.95
## 288: SOUTH PHILA HEAD START 51113.00
## 289: BLAINE HEAD START 51113.00
## 290: BETHUNE HEAD START 51113.00
## 291: MIFFLIN HEAD START 0.00
This looks fishy! It’s rare for any average worth knowing to come out exactly to an integer. I suspect there are very few teachers at these schools. How can we check?
#remember our handy friend .N!!
teachers[ , .(.N, avg_wage = mean(pay_rate)),
by = home_organization_description
][order(-avg_wage)]
## home_organization_description N avg_wage
## 1: LOGAN SCHOOL HEAD START 1 90051.00
## 2: COOK-WISSAHICKON HEAD START 1 90051.00
## 3: SHAWMONT BRIGHT FUTURES 1 90051.00
## 4: MUNOZ-MARIN HEAD START 1 90051.00
## 5: SHARSWOOD HEAD START 1 90051.00
## ---
## 287: DUNBAR, PAUL L. SCHOOL 21 56805.95
## 288: SOUTH PHILA HEAD START 1 51113.00
## 289: BLAINE HEAD START 1 51113.00
## 290: BETHUNE HEAD START 1 51113.00
## 291: MIFFLIN HEAD START 1 0.00
In fact, almost all of the “best-” and “worst-” paid schools are outliers because there’s only one employee listed. So that’s not really a fair measure of dominance.
So what can we do to improve our measure?
What’s typically done is to exclude all schools that don’t have pass some cutoff minimum number of employees. For this exercise, let’s say we want to exclude all schools that have fewer than 10 full-time teachers in the data.
The way to do so is pretty natural. Remember that when we use by, we’re essentially splitting our main data.table up into 291 smaller data.tables, one for each school. We want to keep only those sub-data.tables that have at least 10 teachers – i.e., they have at least 10 rows, i.e., .N is at least 10:
#No longer keep .N as an output -- this is just for focus,
# since we already know all the schools have at least
# ten teachers
teachers[ , if (.N >= 10) .(avg_wage = mean(pay_rate)),
by = home_organization_description
][order(-avg_wage)]
## home_organization_description avg_wage
## 1: NON-PUBLIC PROGRAMS 82723.82
## 2: SOUTH PHILADELPHIA H.S. 78295.27
## 3: GIRLS, PHILA HIGH SCHOOL FOR 77710.83
## 4: POWEL, SAMUEL SCHOOL 77700.50
## 5: PENN TREATY HIGH SCHOOL 76291.82
## ---
## 208: KENSINGTON CAPA 59277.92
## 209: HUEY, SAMUEL B. SCHOOL 58582.12
## 210: PEIRCE, THOMAS M. SCHOOL 58114.21
## 211: CLEMENTE, ROBERTO MIDDLE SCHL 57635.85
## 212: DUNBAR, PAUL L. SCHOOL 56805.95
Here we see our first if statement in R. The basic form of an if statement in R is:
#Note the parentheses around the logical!! Very easy to forget
if (some_logical){
# what to do if true
} else {
# what to do if false
}
It’s important to note that some_logical has to be either TRUE or FALSE – it can’t be a vector!! This sounds easy now but it’s guaranteed to trip you up at some point.
There’s no else statement in the condition we used above. So, for all the schools where there are fewer than 10 teachers, nothing happens (technically, the statement returns NULL – but don’t worry about that for now), and data.table ignores any such sub-data.table.
I don’t really know what NON-PUBLIC PROGRAMS means… but besides that, we see a lot of staple high school names. Of course, all teachers in Philly are on the same teachers’ contract and hence the same pay scale – so what this exercise has really told us is that South Philly high has the most experienced and/or certified teachers, and that teachers at Dunbar are young and/or uncertified.
We could go on exploring / slicing&dicing this data all day, but I think we’ve learned enough essentials of manipulation/inspection for now. We’ll pick up more tricks as we move along.
Merging
Recall our teacher attendance data:
attendance
## ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1: 8350 2013-2014 92.7 93.4 835
## 2: 7320 2013-2014 95.0 93.4 732
## 3: 5440 2013-2014 96.0 93.4 544
## 4: 8320 2013-2014 95.1 93.4 832
## 5: 6440 2013-2014 96.6 93.4 644
## ---
## 208: 5480 2013-2014 92.0 93.4 548
## 209: 1300 2013-2014 91.4 93.4 130
## 210: 6090 2013-2014 95.3 93.4 609
## 211: 1050 2013-2014 94.7 93.4 105
## 212: 4300 2013-2014 94.7 93.4 430
We couldn’t really tell anything about the schools because we didn’t know what the school codes meant.
Well, it appears we’ve found the Rosetta Stone in the teacher salary data. The home_organization field looks very much like the ULCS_NO field in the attendance data.
Obviously, it’d be ideal to know for sure that they represent the same ID for each school. In lieu of that, we can bolster our confidence in the alignment by comparing the fields a bit. Here are some techniques:
#We know from attendance there are 212 schools.
# How many unique schools are represented in the teacher data?
teachers[ , uniqueN(home_organization)]
## [1] 291
#How many of the ULCS_NO values are also found in home_organization
# intersect find the middle of the venn diagram for the
# first and second arguments
length(intersect(attendance$ULCS_NO, teachers[ , unique(home_organization)]))
## [1] 211
That’s pretty convincing if you ask me.
So, how do we match the two data sets? In Stata, depending on which data set you currently have in memory, we would either perform a one-to-many or a many-to-one merge.
In R, merging either way is a cinch. But first, we need a few more tools.
:= - Adding a column to a data.table
When we want to add a new column to an existing data.table, we use the := operator j (the second argument). A quick aside, this is done by reference, which means it’s memory-efficient and fast; see this Q&A for more details.
Let’s try it out with two examples. Right now, in attendance, teacher attendance rates are listed as a percentage (out of 100); sometimes, it’s more convenient to have percentages as a proportion (out of one).
Also, we might like to know the relative attendance rates – the ratio of attendance at a given school to average attendance.
In teachers, gender is stored as a character – either M or F. But it’s often quite convenient to have binary variables stored as logical – for subsetting.
#add attendance as a proportion
attendance[ , attendance_proportion := SCH_TEACHER_ATTEND / 100]
#add relative attendance
attendance[ , attendance_relative := SCH_TEACHER_ATTEND / SDP_TEACHER_ATTEND]
#a quirk of data.table is that after using
# :=, we sometimes have to force printing,
# which we can do by appending [] to the end
attendance[]
## ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1: 8350 2013-2014 92.7 93.4 835
## 2: 7320 2013-2014 95.0 93.4 732
## 3: 5440 2013-2014 96.0 93.4 544
## 4: 8320 2013-2014 95.1 93.4 832
## 5: 6440 2013-2014 96.6 93.4 644
## ---
## 208: 5480 2013-2014 92.0 93.4 548
## 209: 1300 2013-2014 91.4 93.4 130
## 210: 6090 2013-2014 95.3 93.4 609
## 211: 1050 2013-2014 94.7 93.4 105
## 212: 4300 2013-2014 94.7 93.4 430
## attendance_proportion attendance_relative
## 1: 0.927 0.9925054
## 2: 0.950 1.0171306
## 3: 0.960 1.0278373
## 4: 0.951 1.0182013
## 5: 0.966 1.0342612
## ---
## 208: 0.920 0.9850107
## 209: 0.914 0.9785867
## 210: 0.953 1.0203426
## 211: 0.947 1.0139186
## 212: 0.947 1.0139186
#add logical version of gender
teachers[ , male := gender == "M"]
#now, subset to show only male teachers
teachers[(male)]
## last_name first_name pay_rate_type pay_rate title_description
## 1: ABDULALEEM MUHAMMAD SALARIED 65121 TEACHER,FULL TIME
## 2: ABDULLAH RASHEED SALARIED 83382 TEACHER,FULL TIME
## 3: ACKERMAN CARL SALARIED 83382 TEACHER,FULL TIME
## 4: ADAIR DARRELL SALARIED 66461 TEACHER,FULL TIME
## 5: ADAIR JOSEPH SALARIED 51113 TEACHER,FULL TIME
## ---
## 1822: ZUCKER JEFFREY SALARIED 73453 TEACHER,FULL TIME
## 1823: ZUFOLO BRIAN SALARIED 57451 TEACHER,FULL TIME
## 1824: ZUNGOLO WILLIAM SALARIED 90051 TEACHER,FULL TIME
## 1825: ZWICK KEVIN SALARIED 46694 TEACHER,FULL TIME
## 1826: ZWOLAK PAUL SALARIED 54364 TEACHER,FULL TIME
## home_organization home_organization_description
## 1: 6090 RANDOLPH TECHNICAL HIGH SCHOOL
## 2: 4220 BLAINE, JAMES G. SCHOOL
## 3: 2670 CONSTITUTION HIGH SCHOOL
## 4: 2010 FRANKLIN, BENJAMIN HIGH SCHOOL
## 5: 5400 RICHMOND SCHOOL
## ---
## 1822: 7310 FELTONVILLE INTERMEDIATE
## 1823: 5330 HUNTER, WILLIAM H. SCHOOL
## 1824: 8390 FITZPATRICK, A. L. SCHOOL
## 1825: 8300 MAYFAIR SCHOOL
## 1826: 2290 FRANKLIN LEARNING CENTER
## organization_level type_of_representation gender
## 1: CAREER AND TECHNICAL HIGH SCHL PFT-TEACHER M
## 2: ELEMENTARY SCHOOL PFT-TEACHER M
## 3: HIGH SCHOOL PFT-TEACHER M
## 4: HIGH SCHOOL PFT-TEACHER M
## 5: ELEMENTARY SCHOOL PFT-TEACHER M
## ---
## 1822: ELEMENTARY SCHOOL PFT-TEACHER M
## 1823: ELEMENTARY SCHOOL PFT-TEACHER M
## 1824: ELEMENTARY SCHOOL PFT-TEACHER M
## 1825: ELEMENTARY SCHOOL PFT-TEACHER M
## 1826: HIGH SCHOOL PFT-TEACHER M
## run_date male
## 1: 4/1/2016 TRUE
## 2: 4/1/2016 TRUE
## 3: 4/1/2016 TRUE
## 4: 4/1/2016 TRUE
## 5: 4/1/2016 TRUE
## ---
## 1822: 4/1/2016 TRUE
## 1823: 4/1/2016 TRUE
## 1824: 4/1/2016 TRUE
## 1825: 4/1/2016 TRUE
## 1826: 4/1/2016 TRUE
Merge and update
Now we’re equipped to understand the syntax for a merge. Let’s add the school’s name (home_organization_description in teachers) to the attendance data.
First, note that teachers has more data in it than we need. There are many teachers in each school, meaning the mapping between home_organization code and school name is repeated many times (once for each teacher in a given school). This presents a sort of an issue when we try and merge the teachers data into attendance – for each ULCS_NO that gets matched to a home_organization in teachers, there will be many rows, even if they all have the same information. To deal with this we’ll use unique to eliminate all the duplicates. We’ll go step-by-step first, then all at once:
#this will take the FIRST row associated
# with each school. This is important in
# general, but not here since we only
# care about the school-level data for now,
# which is the same in all rows.
schools <- unique(teachers, by = "home_organization")
schools
## last_name first_name pay_rate_type pay_rate title_description
## 1: ABBOTT JOYCE SALARIED 76461 TEACHER,FULL TIME
## 2: ABDALLAH JUWAYRIYAH SALARIED 46694 TEACHER,FULL TIME
## 3: ABDEL-JALIL GHADEER SALARIED 45359 TEACHER,FULL TIME
## 4: ABDUL BASIT BARBARA SALARIED 67706 TEACHER,FULL TIME
## 5: ABDUL-LATEEF VILLIA SALARIED 48945 TEACHER,FULL TIME
## ---
## 287: VALENTIN MADELINE SALARIED 65121 TEACHER,FULL TIME
## 288: WERTZ YORI DESIREE SALARIED 76461 TEACHER,FULL TIME
## 289: WIDMAIER DONNA SALARIED 76461 TEACHER,FULL TIME
## 290: WILLIAMS TAMMAR SALARIED 76461 TEACHER,FULL TIME
## 291: WRIGHT AMY SALARIED 67789 TEACHER,FULL TIME
## home_organization home_organization_description organization_level
## 1: 1290 HAMILTON, ANDREW SCHOOL ELEMENTARY SCHOOL
## 2: 1470 LOCKE, ALAIN SCHOOL ELEMENTARY SCHOOL
## 3: 7440 TAYLOR, BAYARD SCHOOL ELEMENTARY SCHOOL
## 4: 7530 ROWEN, WILLIAM SCHOOL ELEMENTARY SCHOOL
## 5: 1010 BARTRAM, JOHN HIGH SCHOOL HIGH SCHOOL
## ---
## 287: 7262 ELLWOOD SCHOOL HEAD START EARLY CHILDHOOD
## 288: 8352 SPRUANCE SCHOOL HEAD START EARLY CHILDHOOD
## 289: 6202 DAY, ANNA B. HEAD START EARLY CHILDHOOD
## 290: 1027 LEA SCHOOL HEAD START EARLY CHILDHOOD
## 291: 2242 BREGY HEAD START EARLY CHILDHOOD
## type_of_representation gender run_date male
## 1: PFT-TEACHER F 4/1/2016 FALSE
## 2: PFT-TEACHER F 4/1/2016 FALSE
## 3: PFT-TEACHER F 4/1/2016 FALSE
## 4: PFT-TEACHER F 4/1/2016 FALSE
## 5: PFT-TEACHER F 4/1/2016 FALSE
## ---
## 287: PFT- PRE K F 4/1/2016 FALSE
## 288: PFT- PRE K F 4/1/2016 FALSE
## 289: PFT- PRE K F 4/1/2016 FALSE
## 290: PFT- PRE K F 4/1/2016 FALSE
## 291: PFT- PRE K F 4/1/2016 FALSE
#now merge, using := to add the school name column
attendance[schools, school_name :=
home_organization_description,
on = c(ULCS_NO = "home_organization")]
attendance
## ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1: 8350 2013-2014 92.7 93.4 835
## 2: 7320 2013-2014 95.0 93.4 732
## 3: 5440 2013-2014 96.0 93.4 544
## 4: 8320 2013-2014 95.1 93.4 832
## 5: 6440 2013-2014 96.6 93.4 644
## ---
## 208: 5480 2013-2014 92.0 93.4 548
## 209: 1300 2013-2014 91.4 93.4 130
## 210: 6090 2013-2014 95.3 93.4 609
## 211: 1050 2013-2014 94.7 93.4 105
## 212: 4300 2013-2014 94.7 93.4 430
## attendance_proportion attendance_relative
## 1: 0.927 0.9925054
## 2: 0.950 1.0171306
## 3: 0.960 1.0278373
## 4: 0.951 1.0182013
## 5: 0.966 1.0342612
## ---
## 208: 0.920 0.9850107
## 209: 0.914 0.9785867
## 210: 0.953 1.0203426
## 211: 0.947 1.0139186
## 212: 0.947 1.0139186
## school_name
## 1: SPRUANCE, GILBERT SCHOOL
## 2: HOWE, JULIA WARD SCHOOL
## 3: WILLARD, FRANCES E. SCHOOL
## 4: LABRUM,GEN HARRY MIDDLE SCHOOL
## 5: LINGELBACH, ANNA L. SCHOOL
## ---
## 208: KEARNY, GEN. PHILIP SCHOOL
## 209: HARRINGTON, AVERY D. SCHOOL
## 210: RANDOLPH TECHNICAL HIGH SCHOOL
## 211: ROBESON, PAUL HIGH SCHOOL
## 212: HESTON, EDWARD SCHOOL
This general form of the syntax we just used is X[Y, variable_in_x := variable_in_y, on = c(a = "b")]. Let’s break that down a bit:
X and Y are data.tables. We match their rows according to alignment in the variables specified by the on argument within the square brackets ([]) used to access X.
on = c(a = "b") tells R that rows in column a in X (both things on the left) should be matched & associated with rows of the same value in column b in Y.
- we add the variable on the left of
:= to data.table X. The variable on the right of := is typically found in Y – it could also be an expression consisting of columns in X and Y.
Let’s explore a few other ways we could have gone about doing this merge. First, let’s remove the column that we just created so that we’re really starting from scratch. The way to remove columns from a data.table is to set them to NULL, which is very fast.
attendance[ , school_name := NULL][]
## ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1: 8350 2013-2014 92.7 93.4 835
## 2: 7320 2013-2014 95.0 93.4 732
## 3: 5440 2013-2014 96.0 93.4 544
## 4: 8320 2013-2014 95.1 93.4 832
## 5: 6440 2013-2014 96.6 93.4 644
## ---
## 208: 5480 2013-2014 92.0 93.4 548
## 209: 1300 2013-2014 91.4 93.4 130
## 210: 6090 2013-2014 95.3 93.4 609
## 211: 1050 2013-2014 94.7 93.4 105
## 212: 4300 2013-2014 94.7 93.4 430
## attendance_proportion attendance_relative
## 1: 0.927 0.9925054
## 2: 0.950 1.0171306
## 3: 0.960 1.0278373
## 4: 0.951 1.0182013
## 5: 0.966 1.0342612
## ---
## 208: 0.920 0.9850107
## 209: 0.914 0.9785867
## 210: 0.953 1.0203426
## 211: 0.947 1.0139186
## 212: 0.947 1.0139186
#alternative 1: skip defining schools
# extra tidbit: sometimes, both the
# main table (x) and the merging table
# (in the first argument, i -- we called
# it Y above) have columns with the same
# name. To overcome the ambiguity this
# engenders, we can prepend i. to the
# column names in teachers to tell R
# that we're referring _explicitly_
# to that column in teachers. Similarly,
# we can prepend x. to tell R that we're
# referring to the column in attendance
attendance[unique(teachers, by = "home_organization"),
school_name := i.home_organization_description,
on = c(ULCS_NO = "home_organization")][]
## ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1: 8350 2013-2014 92.7 93.4 835
## 2: 7320 2013-2014 95.0 93.4 732
## 3: 5440 2013-2014 96.0 93.4 544
## 4: 8320 2013-2014 95.1 93.4 832
## 5: 6440 2013-2014 96.6 93.4 644
## ---
## 208: 5480 2013-2014 92.0 93.4 548
## 209: 1300 2013-2014 91.4 93.4 130
## 210: 6090 2013-2014 95.3 93.4 609
## 211: 1050 2013-2014 94.7 93.4 105
## 212: 4300 2013-2014 94.7 93.4 430
## attendance_proportion attendance_relative
## 1: 0.927 0.9925054
## 2: 0.950 1.0171306
## 3: 0.960 1.0278373
## 4: 0.951 1.0182013
## 5: 0.966 1.0342612
## ---
## 208: 0.920 0.9850107
## 209: 0.914 0.9785867
## 210: 0.953 1.0203426
## 211: 0.947 1.0139186
## 212: 0.947 1.0139186
## school_name
## 1: SPRUANCE, GILBERT SCHOOL
## 2: HOWE, JULIA WARD SCHOOL
## 3: WILLARD, FRANCES E. SCHOOL
## 4: LABRUM,GEN HARRY MIDDLE SCHOOL
## 5: LINGELBACH, ANNA L. SCHOOL
## ---
## 208: KEARNY, GEN. PHILIP SCHOOL
## 209: HARRINGTON, AVERY D. SCHOOL
## 210: RANDOLPH TECHNICAL HIGH SCHOOL
## 211: ROBESON, PAUL HIGH SCHOOL
## 212: HESTON, EDWARD SCHOOL
#reset again
attendance[ , school_name := NULL]
#alternative 2: use unique within the merge
# by = .EACHI is a syntax unique to merging.
# Using this tells R to do the thing in j
# within each group that's matched.
# So, here, each ULCS_NO gets matched to many
# home_organization rows in teachers.
# within this group, we use unique to
# get the school's name, since unique turns the
# vector of school names (one per teacher in
# each school that's matched) into a single value.
attendance[teachers,
school_name := unique(home_organization_description),
on = c(ULCS_NO = "home_organization"), by = .EACHI][]
## ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1: 8350 2013-2014 92.7 93.4 835
## 2: 7320 2013-2014 95.0 93.4 732
## 3: 5440 2013-2014 96.0 93.4 544
## 4: 8320 2013-2014 95.1 93.4 832
## 5: 6440 2013-2014 96.6 93.4 644
## ---
## 208: 5480 2013-2014 92.0 93.4 548
## 209: 1300 2013-2014 91.4 93.4 130
## 210: 6090 2013-2014 95.3 93.4 609
## 211: 1050 2013-2014 94.7 93.4 105
## 212: 4300 2013-2014 94.7 93.4 430
## attendance_proportion attendance_relative
## 1: 0.927 0.9925054
## 2: 0.950 1.0171306
## 3: 0.960 1.0278373
## 4: 0.951 1.0182013
## 5: 0.966 1.0342612
## ---
## 208: 0.920 0.9850107
## 209: 0.914 0.9785867
## 210: 0.953 1.0203426
## 211: 0.947 1.0139186
## 212: 0.947 1.0139186
## school_name
## 1: SPRUANCE, GILBERT SCHOOL
## 2: HOWE, JULIA WARD SCHOOL
## 3: WILLARD, FRANCES E. SCHOOL
## 4: LABRUM,GEN HARRY MIDDLE SCHOOL
## 5: LINGELBACH, ANNA L. SCHOOL
## ---
## 208: KEARNY, GEN. PHILIP SCHOOL
## 209: HARRINGTON, AVERY D. SCHOOL
## 210: RANDOLPH TECHNICAL HIGH SCHOOL
## 211: ROBESON, PAUL HIGH SCHOOL
## 212: HESTON, EDWARD SCHOOL
#reset again
attendance[ , school_name := NULL]
#alternative 3: like 2, except use [] instead of unique.
# This alternative works, and is faster, but is less robust
# to finding mistakes in your data. Thankfully, this data is
# pretty clean, but we often find messy data in the wild --
# maybe there was a typo in the organization code or school name,
# and instead of each home_organization being matched to
# exactly one home_organization_description, we might find
# it matched to several (e.g., "PS 131" and "PS #131").
# If we use the unique approach from alternative 2,
# we'll get an error, which tells us we need to examine our data
# more closely. This approach will definitely not generate an error.
attendance[teachers,
school_name := home_organization_description[1],
on = c(ULCS_NO = "home_organization"), by = .EACHI][]
## ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1: 8350 2013-2014 92.7 93.4 835
## 2: 7320 2013-2014 95.0 93.4 732
## 3: 5440 2013-2014 96.0 93.4 544
## 4: 8320 2013-2014 95.1 93.4 832
## 5: 6440 2013-2014 96.6 93.4 644
## ---
## 208: 5480 2013-2014 92.0 93.4 548
## 209: 1300 2013-2014 91.4 93.4 130
## 210: 6090 2013-2014 95.3 93.4 609
## 211: 1050 2013-2014 94.7 93.4 105
## 212: 4300 2013-2014 94.7 93.4 430
## attendance_proportion attendance_relative
## 1: 0.927 0.9925054
## 2: 0.950 1.0171306
## 3: 0.960 1.0278373
## 4: 0.951 1.0182013
## 5: 0.966 1.0342612
## ---
## 208: 0.920 0.9850107
## 209: 0.914 0.9785867
## 210: 0.953 1.0203426
## 211: 0.947 1.0139186
## 212: 0.947 1.0139186
## school_name
## 1: SPRUANCE, GILBERT SCHOOL
## 2: HOWE, JULIA WARD SCHOOL
## 3: WILLARD, FRANCES E. SCHOOL
## 4: LABRUM,GEN HARRY MIDDLE SCHOOL
## 5: LINGELBACH, ANNA L. SCHOOL
## ---
## 208: KEARNY, GEN. PHILIP SCHOOL
## 209: HARRINGTON, AVERY D. SCHOOL
## 210: RANDOLPH TECHNICAL HIGH SCHOOL
## 211: ROBESON, PAUL HIGH SCHOOL
## 212: HESTON, EDWARD SCHOOL
It’s up to your taste which approach suits you – each is perfect for a certain situation.
Missing Data
No data set is perfect. In fact, most data sets are nightmares. Missing data is pervasive, so it stands to reason that R is well-equipped to handle missing data.
All missing data is coded in R as NA. That is, if we’ve been careful when loading the data into R, usually by telling R what missing data looks like in a given file. Take fread, for example. We haven’t needed it yet because the two data sets we’ve loaded have been complete/balanced, but fread has an argument called na.strings which is used to tell R what to look for in a data file to consider as missing (i.e., as NA).
We now have some missing data after our merge, however. To see this, note the following:
#is.na returns a logical vector saying
# whether each element of school_name
# is missing (NA) or not
attendance[is.na(school_name)]
## ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1: 2140 2013-2014 95.7 93.4 214
## attendance_proportion attendance_relative school_name
## 1: 0.957 1.024625 NA
#NOTE: TESTING NA WITH EQUALITY DOESN'T WORK
attendance[school_name == NA]
## Empty data.table (0 rows) of 8 cols: ULCS_NO,SCHOOL_YEAR,SCH_TEACHER_ATTEND,SDP_TEACHER_ATTEND,SCHOOL_ID,attendance_proportion...
So what’s up with ULCS_NO 2140? Why does this school have no teachers in the salary data??
Turns out, it is in the salary data, but none of the teachers are listed as TEACHER,FULL TIME:
#remember, salaries is the full set
# of ALL employees of SDP -- we
# didn't remove any observations from here
#(using the nrows argument for print to condense output)
print(salaries[home_organization == "2140"], nrows = 10)
## last_name first_name pay_rate_type pay_rate
## 1: ABNEY GILDA SALARIED 79586
## 2: AIKENS SARAH SALARIED 48110
## 3: AKES RONALD SALARIED 17963
## 4: AMIT-CUBBAGE DONNA SALARIED 84880
## 5: BALLEW MEGAN SALARIED 52530
## ---
## 79: TARANTA CHRISTOPHER SALARIED 84880
## 80: TAYLOR ELIZABETH SALARIED 84880
## 81: VECSI JANEL SALARIED 77961
## 82: VOLIN AVELIN ALEXANDRA SALARIED 77961
## 83: WILLIAMS LESLIE SALARIED 34164
## title_description home_organization
## 1: SCHOOL COUNSELOR, 10 MONTHS 2140
## 2: FOOD SVCS MANAGER II 2140
## 3: FOOD SVCS WORKER II 2140
## 4: TEACHER,DEMONSTRATION 2140
## 5: TEACHER,DEMONSTRATION 2140
## ---
## 79: TEACHER,DEMONSTRATION 2140
## 80: TEACHER,DEMONSTRATION 2140
## 81: TEACHER,DEMONSTRATION 2140
## 82: TEACHER,DEMONSTRATION 2140
## 83: CUSTODIAL ASSISTANT 2140
## home_organization_description organization_level
## 1: MASTERMAN,JULIA R. HIGH SCHOOL HIGH SCHOOL
## 2: MASTERMAN,JULIA R. HIGH SCHOOL HIGH SCHOOL
## 3: MASTERMAN,JULIA R. HIGH SCHOOL HIGH SCHOOL
## 4: MASTERMAN,JULIA R. HIGH SCHOOL HIGH SCHOOL
## 5: MASTERMAN,JULIA R. HIGH SCHOOL HIGH SCHOOL
## ---
## 79: MASTERMAN,JULIA R. HIGH SCHOOL HIGH SCHOOL
## 80: MASTERMAN,JULIA R. HIGH SCHOOL HIGH SCHOOL
## 81: MASTERMAN,JULIA R. HIGH SCHOOL HIGH SCHOOL
## 82: MASTERMAN,JULIA R. HIGH SCHOOL HIGH SCHOOL
## 83: MASTERMAN,JULIA R. HIGH SCHOOL HIGH SCHOOL
## type_of_representation gender run_date
## 1: PFT-TEACHER F 4/1/2016
## 2: PFT-F/S MGR F 4/1/2016
## 3: LOCAL 634 M 4/1/2016
## 4: PFT-TEACHER F 4/1/2016
## 5: PFT-TEACHER F 4/1/2016
## ---
## 79: PFT-TEACHER M 4/1/2016
## 80: PFT-TEACHER F 4/1/2016
## 81: PFT-TEACHER F 4/1/2016
## 82: PFT-TEACHER F 4/1/2016
## 83: LOCAL 1201 F 4/1/2016
#since they're not full-time teachers, what are they?
salaries[home_organization == "2140",
table(title_description)]
## title_description
## ASST PRINCIPAL BUILDING ENGINEER-GROUP III
## 1 1
## CUSTODIAL ASSISTANT FOOD SVCS ASSISTANT
## 2 2
## FOOD SVCS MANAGER II FOOD SVCS UTILITY WORKER
## 1 1
## FOOD SVCS WORKER I FOOD SVCS WORKER II
## 1 1
## GENERAL CLEANER, 8 HOURS ONE TO ONE ASST, SPECIAL ED
## 3 1
## PRINCIPAL SCHOOL COUNSELOR, 10 MONTHS
## 1 3
## SCHOOL NURSE SCHOOL OPERATIONS OFFICER
## 1 1
## SECRETARY I STUDENT CLIMATE STAFF,3 HOURS
## 3 2
## STUDENT CLIMATE STAFF,4 HOURS TEACHER,DEMONSTRATION
## 3 54
## TEACHER,DEMONSTRATION,SPEC ED
## 1
That’s right – at Masterman, none of the teachers appear to be classified as such. They’re classified for wage purposes as TEACHER,DEMONSTRATION. I don’t really know what this means… I was only able to find a small tidbit on the SDP website:
Demonstration Teachers are unique educators. They utilize the most current educational techniques and instructional media, which contribute to effective teaching practices in their specific teaching areas. They are observed at times by personnel from within the School District as well as visitors from other areas and institutions, both national and international.
This appears to be something basically exclusive to Masterman, with a fair presence at John Hancock as well:
salaries[title_description == "TEACHER,DEMONSTRATION",
.N, by = home_organization_description]
## home_organization_description N
## 1: HANCOCK, JOHN SCHOOL 22
## 2: PENN TREATY HIGH SCHOOL 1
## 3: MASTERMAN,JULIA R. HIGH SCHOOL 54
## 4: HOWE, JULIA WARD SCHOOL 1
## 5: FRANKFORD HIGH SCHOOL 1
## 6: LABRUM,GEN HARRY MIDDLE SCHOOL 3
## 7: MASTBAUM, JULES E. HIGH SCHOOL 1
So, it turns out we’d be better off just using salaries for the merge to define the school’s name:
attendance[salaries,
school_name := unique(home_organization_description),
on = c(ULCS_NO = "home_organization"), by = .EACHI]
attendance[is.na(school_name)] #fixed!
## Empty data.table (0 rows) of 8 cols: ULCS_NO,SCHOOL_YEAR,SCH_TEACHER_ATTEND,SDP_TEACHER_ATTEND,SCHOOL_ID,attendance_proportion...
Now we can repeat our earlier exercise and reveal the names of the best- and worst- attended schools in Philly. While we’re at it, we can introduce column subsetting. To focus attention on only a few columns, we can simply put them in j and wrap them in .(). Only the columns we name here will show up in the output (so that it doesn’t take up as much room and we can immediately focus on the most important information)
attendance[SCH_TEACHER_ATTEND < 90 |
SCH_TEACHER_ATTEND > 97
][order(-SCH_TEACHER_ATTEND),
.(ULCS_NO, school_name, SCH_TEACHER_ATTEND)]
## ULCS_NO school_name SCH_TEACHER_ATTEND
## 1: 1190 MOTIVATION HIGH SCHOOL 98.1
## 2: 6040 SAUL, WALTER B. HIGH SCHOOL 97.6
## 3: 8560 THE WORKSHOP SCHOOL 97.6
## 4: 5340 LUDLOW, JAMES R. SCHOOL 97.2
## 5: 7350 LOWELL, JAMES R. SCHOOL 89.9
## 6: 7510 BETHUNE, MARY MCLEOD SCHOOL 89.8
## 7: 8240 DISSTON, HAMILTON SCHOOL 89.7
## 8: 7290 STEARNE, ALLEN M. SCHOOL 89.7
## 9: 5430 AMY 5 AT JAMES MARTIN 89.6
## 10: 7220 CARNELL, LAURA H. SCHOOL 89.0
## 11: 6360 ROOSEVELT ELEMENTARY SCHOOL 89.0
## 12: 1330 HUEY, SAMUEL B. SCHOOL 88.8
## 13: 4240 CASSIDY,LEWIS C ACADEMICS PLUS 88.7
## 14: 1130 TILDEN MIDDLE SCHOOL 88.5
## 15: 1340 LEA, HENRY C. 88.4
## 16: 6260 HOUSTON, HENRY H. SCHOOL 88.0
## 17: 2480 ARTHUR, CHESTER A. SCHOOL 87.8
## 18: 6340 PENNELL, JOSEPH ELEMENTARY 86.9
## 19: 5370 MOFFET, JOHN SCHOOL 86.5
## 20: 4570 MEADE, GEN. GEORGE G. SCHOOL 86.4
## 21: 4350 RHODES ELEMENTARY SCHOOL 85.3
## ULCS_NO school_name SCH_TEACHER_ATTEND
How fitting! Motivation High School appears to value teacher attendance very highly.
Regressions
OK, enough data manipulation for now. Let’s get to some actual analysis, for cryin’ out loud!
We of course don’t have any experimental or quasi-experimental data here, so we’ll only be able to tease out correlations from our data. But we do have enough data to illustrate the basics of typical regression techniques, so here we go!
OLS
Most linear regressions in R are handled by the lm function (stands for linear model).
Let’s run a very simple regression – the most basic regression testing what the gender pay gap is. Like any of the other typical functions we’ve been using, we run lm in j of our data.table.
##no frills regression
teachers[ , lm(pay_rate ~ gender)]
##
## Call:
## lm(formula = pay_rate ~ gender)
##
## Coefficients:
## (Intercept) genderM
## 68624.9 -935.3
The basic output of lm tells us 1) the regression we ran and 2) the point estimates of the coefficients. This is nice, but I find it rather uninformative – crucially, the standard errors are missing. But fear not! We’re only seeing the surface of what lm does.
#now we assign the output of lm so
# we can explore it in a bit more detail
gender_reg <- teachers[ , lm(pay_rate ~ gender)]
str(gender_reg)
## List of 13
## $ coefficients : Named num [1:2] 68625 -935
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "genderM"
## $ residuals : Named num [1:6652] 7836 -21931 -23266 -919 -19680 ...
## ..- attr(*, "names")= chr [1:6652] "1" "2" "3" "4" ...
## $ effects : Named num [1:6652] -5576090 -34041 -23198 -851 -19612 ...
## ..- attr(*, "names")= chr [1:6652] "(Intercept)" "genderM" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:6652] 68625 68625 68625 68625 68625 ...
## ..- attr(*, "names")= chr [1:6652] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:6652, 1:2] -81.5598 0.0123 0.0123 0.0123 0.0123 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:6652] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "genderM"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## .. ..- attr(*, "contrasts")=List of 1
## .. .. ..$ gender: chr "contr.treatment"
## ..$ qraux: num [1:2] 1.01 1.01
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 6650
## $ contrasts :List of 1
## ..$ gender: chr "contr.treatment"
## $ xlevels :List of 1
## ..$ gender: chr [1:2] "F" "M"
## $ call : language lm(formula = pay_rate ~ gender)
## $ terms :Classes 'terms', 'formula' length 3 pay_rate ~ gender
## .. ..- attr(*, "variables")= language list(pay_rate, gender)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "pay_rate" "gender"
## .. .. .. ..$ : chr "gender"
## .. ..- attr(*, "term.labels")= chr "gender"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: 0x2e186e8>
## .. ..- attr(*, "predvars")= language list(pay_rate, gender)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "character"
## .. .. ..- attr(*, "names")= chr [1:2] "pay_rate" "gender"
## $ model :'data.frame': 6652 obs. of 2 variables:
## ..$ pay_rate: int [1:6652] 76461 46694 45359 67706 48945 49615 65121 83382 75850 59532 ...
## ..$ gender : chr [1:6652] "F" "F" "F" "F" ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 pay_rate ~ gender
## .. .. ..- attr(*, "variables")= language list(pay_rate, gender)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "pay_rate" "gender"
## .. .. .. .. ..$ : chr "gender"
## .. .. ..- attr(*, "term.labels")= chr "gender"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: 0x2e186e8>
## .. .. ..- attr(*, "predvars")= language list(pay_rate, gender)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "character"
## .. .. .. ..- attr(*, "names")= chr [1:2] "pay_rate" "gender"
## - attr(*, "class")= chr "lm"
We can see from str that lm is a list with a long set of components. The output is a bit information overload; we can hone in on some important details by tinkering with our gender_reg object some more:
names(gender_reg)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"
Our gender_reg object is actually pretty complicate! Just by running lm, we’re given back all sorts of essential results about the relationship between pay_rate and gender:
coefficients are of course the coefficients of the regression. We can access them with gender_reg$coefficients.
residuals are of course the residuals – actual - predicted value for each observation.
effects are “the uncorrelated single-degree-of-freedom values obtained by projecting the data onto the successive orthogonal subspaces generated by the QR decomposition during the fitting process.” (I have no idea what this means)
rank should be equal to the number of coefficients (including the intercept), unless there is a multicollinearity problem.
fitted.values are the y_hat, the predicted pay_rate produced by the model for each observation.
assign helps map the matrix of regressors to the formula (I’ve never used this)
qr is the QR decomposition matrix associated with the regression (I’ve never used this)
df.residual is residual degrees of freedom (= n - k), where n is the total observations and k is the number of regressors
contrasts tell which contrasts were used (contrasts are levels of discrete variables)
xlevels gives the levels of the factors (discrete variables) that we included – for gender_reg, this is F and M, e.g.
call stores the regression we ran as a call to lm
terms gives a bunch of information about the variables we included in the regression – what they are, which are factors, which is the intercept, etc.
model gives the X matrix (predictor matrix)
It’s wonderful to know that R keeps track of all of this stuff for us, but what about the real meat of our regression – the stars! For this, we’ll use the summary function, which has a method for the output of lm that gives a nice, easily-digested summary of the model we ran:
summary(gender_reg)
##
## Call:
## lm(formula = pay_rate ~ gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68625 -10239 1517 8771 22361
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68624.9 203.6 337.040 <2e-16 ***
## genderM -935.3 388.6 -2.407 0.0161 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14140 on 6650 degrees of freedom
## Multiple R-squared: 0.0008702, Adjusted R-squared: 0.00072
## F-statistic: 5.792 on 1 and 6650 DF, p-value: 0.01613
This calls attention to some summary statistics about the residuals (5-number-summary, as well as residual standard erro), the predictive power (R squared, multiple R squared, F statistic), but most importantly, it prints the standard error of each coefficient, and the associated two-sided t test results for each coefficient.
Here, we can see that men are paid on average significantly less (at alpha = .05) than are women among Philadelphia teachers, by about $1,000.
We spent some time going over the details, so let’s summarize quickly with a comparison to Stata. Here’s the boiled-down version of getting a regression summary in R:
teachers[ , summary(lm(pay_rate ~ gender))]
##
## Call:
## lm(formula = pay_rate ~ gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68625 -10239 1517 8771 22361
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68624.9 203.6 337.040 <2e-16 ***
## genderM -935.3 388.6 -2.407 0.0161 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14140 on 6650 degrees of freedom
## Multiple R-squared: 0.0008702, Adjusted R-squared: 0.00072
## F-statistic: 5.792 on 1 and 6650 DF, p-value: 0.01613
In Stata, to do what we just did, one would simply run reg pay_rate gender. Hard to get simpler than that! My sales pitch in favor of R is only that it’s not that much more complicated, and as we saw above, R gives us at-hand access to a slew of summary statistics about the regression that are a bit less transparent to access in Stata.
Other covariates
It’s not a stretch to expand our regression to include other covariates to try and tame their influence. Perfect covariates here would be experience and certification, but sadly we’re not so lucky. But we do have a bunch of other school-level covariates we can include. Let’s try a few:
#skipping right to the summary
teachers[ , summary(lm(pay_rate ~ gender + school_pct_male + school_pct_black))]
##
## Call:
## lm(formula = pay_rate ~ gender + school_pct_male + school_pct_black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69929 -10025 1259 9614 24684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66136.6 1904.8 34.722 < 2e-16 ***
## genderM -513.0 397.8 -1.290 0.1972
## school_pct_male 8317.6 3626.9 2.293 0.0219 *
## school_pct_black -4014.2 588.1 -6.826 9.56e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14060 on 6312 degrees of freedom
## (336 observations deleted due to missingness)
## Multiple R-squared: 0.008304, Adjusted R-squared: 0.007833
## F-statistic: 17.62 on 3 and 6312 DF, p-value: 2.188e-11
Note that we lost a lot of observations because school_pct_male and school_pct_black are frequently missing. This was included in the summary output: (336 observations deleted due to missingness).
How do we interpret the results of this regression? The most naive possible conclusion is that mostly-black schools pay their teachers less – i.e., that there’s somehow a causal relationship between student makeup and the wages offered to teachers.
But we know that the teachers’ contract is fixed district-wide, so each school has no power whatsoever (well, not quite…) over what its teachers are paid. Instead, we’re witnessing the direct result of selection bias. We can get into this more later when we explore plotting.
Interactions
Stata handles interactions with the # operator. R, by contrast, uses *. So if we wanted to explore the interaction between gender and school_pct_male (i.e., to allow the slope of school_pct_male to differ by gender), we’d simply run:
#In Stata: reg pay_rate gender#school_pct_white
teachers[ , summary(lm(pay_rate ~ gender*school_pct_white))]
##
## Call:
## lm(formula = pay_rate ~ gender * school_pct_white)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72526 -9521 3577 7368 21002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68943.1 453.4 152.044 <2e-16 ***
## genderM 770.6 880.4 0.875 0.3814
## school_pct_white 4297.2 1515.3 2.836 0.0046 **
## genderM:school_pct_white -3852.0 3205.6 -1.202 0.2296
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13740 on 3330 degrees of freedom
## (3318 observations deleted due to missingness)
## Multiple R-squared: 0.002439, Adjusted R-squared: 0.001541
## F-statistic: 2.714 on 3 and 3330 DF, p-value: 0.04332
The output genderM:school_pct_white is the interaction term – it’s the difference slope of the linear relationship between salary and student gender for female and male teachers. Specifically, the expected pay change for a female teacher from a one-percentage-point increase in the percentage of male students at her school is 43 (we divide by 100 since school_pct_white is measured in proportions, not percentages); for men, the expected pay change is 43 + -39 = 4.
This is not significant, meaning that the relationship between school whiteness, gender, and pay mainly play out through student ethnicity.
Functions of variables
One of my big pet peeves in Stata is the inability to run something like reg y log(x) – we’re forced to first create a logx variable like gen logx = log(x). Now we’ve got two objects in memory that are basically the same information! A waste of memory, in addition to being a pain to remember to do this all the time.
R does not suffer from this inanity. We can simply use functions within the formula specification:
#Let's try!
teachers[ , summary(lm(log(pay_rate) ~ gender))]
## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): NA/NaN/Inf in 'y'
#Whoops. Turns out we still have some teachers
# who are stored as having pay rate 0:
teachers[pay_rate == 0, .N]
## [1] 31
#So we need to exclude them because log(0) is -Inf:
teachers[pay_rate>0, summary(lm(log(pay_rate) ~ gender))]
##
## Call:
## lm(formula = log(pay_rate) ~ gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76603 -0.12744 0.04124 0.14091 0.30451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.121710 0.002995 3713.032 < 2e-16 ***
## genderM -0.018088 0.005708 -3.169 0.00154 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2075 on 6619 degrees of freedom
## Multiple R-squared: 0.001515, Adjusted R-squared: 0.001364
## F-statistic: 10.04 on 1 and 6619 DF, p-value: 0.001538
It’s a bit harder to interpret the coefficients in a regression with logs and discrete covariates, but the rule is still roughly that the coefficients are percentages. So men are paid about 2% less on average in Philly.
More functions of variables
What if we wanted to control not for the combined percentage of black and hispanic students at the school?
In Stata, we’d have to do something like gen black_hisp = school_pct_black + school_pct_hispanic. And from what we’ve seen so far, lm(pay_rate ~ school_pct_black + school_pct_hispanic) will give us separate coefficients for black and hispanic groups.
To get variables like this (most commonly, involving simple arithmetic operations like + or *), there’s the special I function for regression formulas. Basically, we surround some operation done on our variables with I(), and whatever results from that will be considered as a single covariate. Watch:
teachers[ , summary(lm(pay_rate ~ gender +
I(school_pct_black + school_pct_hispanic)))]
##
## Call:
## lm(formula = pay_rate ~ gender + I(school_pct_black + school_pct_hispanic))
##
## Residuals:
## Min 1Q Median 3Q Max
## -71223 -9740 1986 9112 23446
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 73646.2 602.5 122.243
## genderM -278.9 458.3 -0.609
## I(school_pct_black + school_pct_hispanic) -6876.7 859.7 -7.999
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## genderM 0.543
## I(school_pct_black + school_pct_hispanic) 1.59e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13770 on 4509 degrees of freedom
## (2140 observations deleted due to missingness)
## Multiple R-squared: 0.01422, Adjusted R-squared: 0.01378
## F-statistic: 32.52 on 2 and 4509 DF, p-value: 9.465e-15
Predictions from a model
One of the most common things to do with a fitted regression is to use it for in- or out-of-sample predictions. For this, R has the predict function, which takes as its first argument the result of a model (e.g., lm), and as its second argument a new data.table.
Any prediction is model dependent; we’ll compare the predictions that come out of a few different relevant specifications.
#Model 1: Simplest linear specification
reg1 <-
teachers[ , lm(pay_rate ~ gender + school_pct_male +
school_pct_black + school_pct_disadv)]
#Model 2: Some interactions
reg2 <-
teachers[ , lm(pay_rate ~ gender*school_pct_disadv +
school_pct_male + school_pct_black)]
#Model 3: Full interactions
reg3 <-
teachers[ , lm(pay_rate ~ gender*school_pct_disadv*
school_pct_male*school_pct_black)]
Let’s try to predict the wages of a teacher who is male and at a school with 70% male, 40% black students, and 80% economically disadvantaged students with each model.
#declare new data to use for prediction
predict_table <- data.table(gender = "M", school_pct_male = .7,
school_pct_black = .4,
school_pct_disadv = .8)
#prediction from Model 1
predict(reg1, predict_table)
## 1
## 77028.89
#prediction from Model 2
predict(reg2, predict_table)
## 1
## 76675.7
#prediction from Model 3
predict(reg3, predict_table)
## 1
## 73401.6
Many predictions
This will be more useful when we get to plotting, but often in more complicated models we want to trace out the marginal effect of a variable on our prediction, holding other regressors constant – is the relationship monotonic? Increasing?
For this, we simply provide more than one row in the predict_table that we supply, e.g.:
#explore the relationship between % male students
# and teacher pay
predict_table <-
data.table(gender = "M",
school_pct_male =
#only use 10 points for now for
# simplicity since we're
# not plotting
seq(.5, 1, length.out = 10),
school_pct_black = .4,
school_pct_disadv = .8)
#Since all of these models are linear,
# we know for sure the relationship
# is monotonic, and we can predict
# the change from step to step because
# we know the coefficient on school_pct_male
predict(reg1, predict_table)
## 1 2 3 4 5 6 7 8
## 74441.11 75159.94 75878.77 76597.59 77316.42 78035.24 78754.07 79472.90
## 9 10
## 80191.72 80910.55
Prediction intervals
Point estimates are rarely the only thing we care about – as important are reasonable ranges in which our predictions might fall. We may predict wages of $75,000, but it would hardly call our model into question if we found actual wages of $75,001 for that person. Prediction intervals are a way of coupling our predictions with a range of estimates that are consistent with our model.
To this end, predict also has a few other arguments we can use to produce prediction intervals, e.g., interval and level, which tell the type of interval we want and the “alpha” to use, respectively:
predict(reg1, predict_table,
interval = "prediction", level = .95)
## fit lwr upr
## 1 74441.11 46800.92 102081.3
## 2 75159.94 47509.83 102810.0
## 3 75878.77 48212.76 103544.8
## 4 76597.59 48909.71 104285.5
## 5 77316.42 49600.70 105032.1
## 6 78035.24 50285.75 105784.7
## 7 78754.07 50964.87 106543.3
## 8 79472.90 51638.10 107307.7
## 9 80191.72 52305.46 108078.0
## 10 80910.55 52966.98 108854.1
That the intervals are so wide shows just how poor our chosen covariates are at predicting teachers’ wages.
Probit/Logit
The next most common form of regression analysis is logit/probit, for when we’re interested in understanding predictors of a binary dependent variable. I’ve never heard a compelling empirical reason for using one over the other; logistic is far more common, probably because it provides odds ratios, which are infinitely more interpretable than are the plain coefficients of either logit or probit models. But the difference in R between running logit and probit is trivial, so we’ll cover both here.
To avoid simply re-hashing what we did in the previous section and inducing a room more rife with glaze than Krispy Kreme, let’s focus on a new variable for our left-hand side.
Noticing that most schools are 100% economically disadvantaged, I wonder, are there certain characteristics which help predict that a teacher is at a school with less than full “poverty”?
We’ll define “less than full poverty” as school_pct_disadv < 100, which is true for teachers[school_pct_disadv < 100, .N] = 1666 teachers (and unique(teachers, by = "home_organization")[school_pct_disadv < 100, .N] = 52 schools).
The workhorse of binary-dependent-variable models in R is glm. This stands for generalized linear models. Generalized linear models subsume OLS and incorporate several regression models which can be related to OLS by a “link function”, including logit, probit and Poisson regression. See Wikipedia for more.
glm works basically like lm – its first argument is specified in exactly the same way (specifically, with an object of class formula), but we also have to include a second argument – family. This is basically specifying to glm the type of link function that you’ve got in mind for your regression. There is a multitude of options – Gamma, inverse.gaussian, poisson, quasipoisson, etc. (see ?family), but the most common and important is binomial.
Technically, what we pass to the family argument is a function – meaning most importantly that it itself accepts arguments! Specifically, the binomial function accepts one important argument, link. By default, this is "logit", meaning that the default for family = binomial is to run a logistic regression. There are two other options – family = "probit" (probit regression) and family = "cauchit" (which uses a Cauchy CDF, if you know what that is).
Let’s see this in action and run our first logit and probits:
#We could define our own "poverty" variable,
# but instead we'll illustrate again another
# way to use the I function in formulas
logit <- teachers[ , glm(I(school_pct_disadv < 100) ~
pay_rate + gender + school_pct_male,
family = binomial)]
summary(logit)
##
## Call:
## glm(formula = I(school_pct_disadv < 100) ~ pay_rate + gender +
## school_pct_male, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5236 -0.7773 -0.6549 0.5844 2.3622
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.072e+00 3.590e-01 8.556 < 2e-16 ***
## pay_rate 1.922e-05 2.240e-06 8.579 < 2e-16 ***
## genderM 3.649e-01 6.554e-02 5.568 2.58e-08 ***
## school_pct_male -1.080e+01 6.429e-01 -16.795 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7216.1 on 6363 degrees of freedom
## Residual deviance: 6793.8 on 6360 degrees of freedom
## (288 observations deleted due to missingness)
## AIC: 6801.8
##
## Number of Fisher Scoring iterations: 4
#run a probit instead by specifying the link
# argument in the binomial function
probit <- teachers[ , glm(I(school_pct_disadv < 100) ~
pay_rate + gender + school_pct_male,
family = binomial(link = "probit"))]
summary(probit)
##
## Call:
## glm(formula = I(school_pct_disadv < 100) ~ pay_rate + gender +
## school_pct_male, family = binomial(link = "probit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4922 -0.7824 -0.6595 0.5984 2.4259
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.809e+00 2.082e-01 8.686 < 2e-16 ***
## pay_rate 1.104e-05 1.287e-06 8.579 < 2e-16 ***
## genderM 2.212e-01 3.866e-02 5.721 1.06e-08 ***
## school_pct_male -6.357e+00 3.705e-01 -17.160 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7216.1 on 6363 degrees of freedom
## Residual deviance: 6800.8 on 6360 degrees of freedom
## (288 observations deleted due to missingness)
## AIC: 6808.8
##
## Number of Fisher Scoring iterations: 4
We can’t compare coefficients between the models, except for their signs. The signs and significance both agree (as we would hope), and again seem to be telling us stuff we may have surmised – high-poverty schools have lower-wage teachers, have fewer male teachers, and have more male students.
To compare the models, we can look at the predicted probability of being in a low-poverty district for a typical teacher. We’ll define a typical teacher as having the median pay rate; we’ll plug in the percentage of male teachers to the coefficient on gender; and we’ll plug in the average percentage of male students.
I’m unaware of a way to tell R to use a percentage instead of 0/1 in predict, so we have to get our predictions by hand. Luckily this isn’t so bad:
#here is the "typical teacher"; we
# include 1 first to stand for the
# intercept term
typical <- teachers[!is.na(school_pct_male),
c(1, median(pay_rate), mean(gender == "M"),
mean(school_pct_male))]
#prediction from the logit model; recall that
# the interpretation of logit is that:
# P[y = 1 | X] = F(X'B)
# where X are our predictors, F is the link
# function CDF, and B is the coefficient vector
#In the case of logistic regression, F is the
# logistic CDF:
plogis(sum(logit$coefficients * typical))
## [1] 0.2446419
#In the case of probit, F is the normal CDF:
pnorm(sum(probit$coefficients * typical))
## [1] 0.2480797
#For comparison, the overall average in the sample is:
teachers[ , mean(school_pct_disadv < 100, na.rm = TRUE)]
## [1] 0.2598253
Reshaping
Let’s read the full demographic file on student gender again. This is just a copy-paste of what we did above.
school_gender <-
setDT(read_excel("./data/2015-2016 Enr Dem (Suppressed).xls",
sheet = "Gender", skip = 6,
col_names = c("", "school_id", "school_name", "grade",
"total_enrolled", "count_female",
"pct_female", "count_male", "pct_male")))
school_gender
## school_id school_name grade total_enrolled
## 1: 101 John Bartram High School 9.000000 200
## 2: 101 John Bartram High School 10.000000 197
## 3: 101 John Bartram High School 11.000000 172
## 4: 101 John Bartram High School 12.000000 141
## 5: 101 John Bartram High School ALL GRADES 710
## ---
## 1662: 878 Philadelphia Virtual Academy 9.000000 46
## 1663: 878 Philadelphia Virtual Academy 10.000000 69
## 1664: 878 Philadelphia Virtual Academy 11.000000 79
## 1665: 878 Philadelphia Virtual Academy 12.000000 64
## 1666: 878 Philadelphia Virtual Academy ALL GRADES 311
## count_female pct_female count_male pct_male
## 1: 70.000000 0.350000 130.000000 0.650000
## 2: 86.000000 0.436548 111.000000 0.563452
## 3: 78.000000 0.453488 94.000000 0.546512
## 4: 63.000000 0.446809 78.000000 0.553191
## 5: 297.000000 0.418310 413.000000 0.581690
## ---
## 1662: s s s s
## 1663: 43.000000 0.623188 26.000000 0.376812
## 1664: 47.000000 0.594937 32.000000 0.405063
## 1665: 41.000000 0.640625 23.000000 0.359375
## 1666: 191.000000 0.614148 120.000000 0.385852
#Note that grade is pretty poorly formatted.
# Let's fix that quickly. I'll explain what's
# going on here later when we go into
# regular expressions in a bit more depth.
# Basically, we're deleting the decimal and everything after it.
school_gender[grade != "ALL GRADES",
grade := paste0("GRADE_", gsub("\\..*", "", grade))][]
## school_id school_name grade total_enrolled
## 1: 101 John Bartram High School GRADE_9 200
## 2: 101 John Bartram High School GRADE_10 197
## 3: 101 John Bartram High School GRADE_11 172
## 4: 101 John Bartram High School GRADE_12 141
## 5: 101 John Bartram High School ALL GRADES 710
## ---
## 1662: 878 Philadelphia Virtual Academy GRADE_9 46
## 1663: 878 Philadelphia Virtual Academy GRADE_10 69
## 1664: 878 Philadelphia Virtual Academy GRADE_11 79
## 1665: 878 Philadelphia Virtual Academy GRADE_12 64
## 1666: 878 Philadelphia Virtual Academy ALL GRADES 311
## count_female pct_female count_male pct_male
## 1: 70.000000 0.350000 130.000000 0.650000
## 2: 86.000000 0.436548 111.000000 0.563452
## 3: 78.000000 0.453488 94.000000 0.546512
## 4: 63.000000 0.446809 78.000000 0.553191
## 5: 297.000000 0.418310 413.000000 0.581690
## ---
## 1662: s s s s
## 1663: 43.000000 0.623188 26.000000 0.376812
## 1664: 47.000000 0.594937 32.000000 0.405063
## 1665: 41.000000 0.640625 23.000000 0.359375
## 1666: 191.000000 0.614148 120.000000 0.385852
Above, we were focused only on school-level demographic data, so we threw away a lot of information.
In certain settings, we may be more focused on grade-level outcomes (if, for example, we knew the grade that each teacher taught, primarily, than we might want to associate the grade-level demographics to each teacher).
All of this information is here in the file we’ve got, but it’s in a slightly unfortunate format – there are many rows associated with each school. It’s much easier to merge a data.table where each school is found in one, and only one, row.
Long to Wide: dcast
To accomplish this in Stata we would use reshape wide, like reshape wide stub, i(i) j(j).
In R, we use the dcast function (I believe this means data cast, with the idea that we’re casting data “out and to the side” by going from long to wide). There are three main arguments to dcast: data (the data.table to be reshaped), formula (a specification of how to reshape, akin to the i(i) and j(j) in Stata), and value.var (the equivalent of stub in Stata).
It can be confusing to wrap your head around which arguments go where; here is an illustration to help orient your thinking about a given problem:

Basically, the variables that we want to end up as rows, we put on the left of the formula (Left-Hand Side), and those that we want to end up as columns we put on the right of the formula. value.var is(are) the variable(s) dictating what comes from each LHS/RHS pair in the data.
Let’s reshape the gender demographics data so that there is only one row associated with each school:
school_gender_wide <-
dcast(school_gender, school_id ~ grade, value.var = "pct_male")
school_gender_wide
## school_id ALL GRADES GRADE_0 GRADE_1 GRADE_10 GRADE_11 GRADE_12
## 1: 101 0.581690 NA NA 0.563452 0.546512 0.553191
## 2: 102 0.550186 NA NA 0.468531 0.549180 0.537736
## 3: 103 0.530888 NA NA 0.547826 0.480315 0.512195
## 4: 105 0.425170 NA NA 0.485714 0.426667 0.324324
## 5: 110 0.547945 NA NA 0.517241 0.537190 0.500000
## ---
## 214: 842 0.535922 0.603604 0.457944 NA NA NA
## 215: 843 0.501355 0.515625 0.515152 NA NA NA
## 216: 844 0.529412 0.515873 0.551724 NA NA NA
## 217: 856 0.618357 NA NA s 0.543860 s
## 218: 878 0.385852 NA NA 0.376812 0.405063 0.359375
## GRADE_2 GRADE_3 GRADE_4 GRADE_5 GRADE_6 GRADE_7 GRADE_8
## 1: NA NA NA NA NA NA NA
## 2: NA NA NA NA NA NA NA
## 3: NA NA NA NA NA NA NA
## 4: NA NA NA NA NA NA NA
## 5: NA NA NA NA NA NA NA
## ---
## 214: 0.470149 0.549550 0.500000 0.568000 0.576577 0.459459 0.656863
## 215: 0.556962 0.529412 0.481481 0.486842 0.468354 0.478723 0.481481
## 216: 0.552448 0.537879 0.467213 0.545455 NA NA NA
## 217: NA NA NA NA NA NA NA
## 218: NA NA NA NA s s s
## GRADE_9
## 1: 0.650000
## 2: 0.628743
## 3: 0.575163
## 4: 0.466667
## 5: 0.607595
## ---
## 214: NA
## 215: NA
## 216: NA
## 217: 0.555556
## 218: s
We now have all of the grades as columns in the output. Note how much data is missing – this shows that most schools only cover a small range of grades. The grades are also out of order – they’ve been alphabetized, and "GRADE 10" comes before "GRADE 2" alphabetically. We can fix this by using factors; we may get into that later.
Note that we’ve lost a lot of information that was in or data originally – the data on student counts, the name of the school, etc. We can keep this sort of thing by using dcast more flexibly:
#keep school name by adding it as an ID
# variable; since school_name is the same
# for every school_id, this will not
# change our output.
dcast(school_gender, school_id + school_name ~ grade, value.var = "pct_male")
## school_id school_name ALL GRADES
## 1: 101 John Bartram High School 0.581690
## 2: 102 West Philadelphia High School 0.550186
## 3: 103 High School of the Future 0.530888
## 4: 105 Paul Robeson High School for Human Services 0.425170
## 5: 110 William L. Sayre High School 0.547945
## ---
## 214: 842 Stephen Decatur School 0.535922
## 215: 843 Joseph Greenberg School 0.501355
## 216: 844 William H. Loesche School 0.529412
## 217: 856 The Workshop School 0.618357
## 218: 878 Philadelphia Virtual Academy 0.385852
## GRADE_0 GRADE_1 GRADE_10 GRADE_11 GRADE_12 GRADE_2 GRADE_3
## 1: NA NA 0.563452 0.546512 0.553191 NA NA
## 2: NA NA 0.468531 0.549180 0.537736 NA NA
## 3: NA NA 0.547826 0.480315 0.512195 NA NA
## 4: NA NA 0.485714 0.426667 0.324324 NA NA
## 5: NA NA 0.517241 0.537190 0.500000 NA NA
## ---
## 214: 0.603604 0.457944 NA NA NA 0.470149 0.549550
## 215: 0.515625 0.515152 NA NA NA 0.556962 0.529412
## 216: 0.515873 0.551724 NA NA NA 0.552448 0.537879
## 217: NA NA s 0.543860 s NA NA
## 218: NA NA 0.376812 0.405063 0.359375 NA NA
## GRADE_4 GRADE_5 GRADE_6 GRADE_7 GRADE_8 GRADE_9
## 1: NA NA NA NA NA 0.650000
## 2: NA NA NA NA NA 0.628743
## 3: NA NA NA NA NA 0.575163
## 4: NA NA NA NA NA 0.466667
## 5: NA NA NA NA NA 0.607595
## ---
## 214: 0.500000 0.568000 0.576577 0.459459 0.656863 NA
## 215: 0.481481 0.486842 0.468354 0.478723 0.481481 NA
## 216: 0.467213 0.545455 NA NA NA NA
## 217: NA NA NA NA NA 0.555556
## 218: NA NA s s s s
#keep the count of male students; for brevity,
# we'll also exclude a bunch of grades to the
# output is not as space-consuming
dcast(school_gender[grade %in% paste0("GRADE_", 1:6)],
school_id ~ grade, value.var = c("pct_male", "count_male"))
## school_id pct_male_GRADE_1 pct_male_GRADE_2 pct_male_GRADE_3
## 1: 113 NA NA NA
## 2: 120 0.450820 0.470000 0.489362
## 3: 123 0.491525 0.571429 0.402985
## 4: 125 0.534483 0.478723 0.509091
## 5: 126 0.439560 0.523256 0.530303
## ---
## 164: 841 0.486486 0.554622 0.438776
## 165: 842 0.457944 0.470149 0.549550
## 166: 843 0.515152 0.556962 0.529412
## 167: 844 0.551724 0.552448 0.537879
## 168: 878 NA NA NA
## pct_male_GRADE_4 pct_male_GRADE_5 pct_male_GRADE_6 count_male_GRADE_1
## 1: NA s 0.548148 NA
## 2: 0.425000 0.482353 0.564516 55.000000
## 3: s 0.615385 0.551020 29.000000
## 4: 0.536082 0.515789 NA 62.000000
## 5: 0.633803 0.588235 0.557692 40.000000
## ---
## 164: 0.560000 0.535714 0.552083 54.000000
## 165: 0.500000 0.568000 0.576577 49.000000
## 166: 0.481481 0.486842 0.468354 51.000000
## 167: 0.467213 0.545455 NA 64.000000
## 168: NA NA s NA
## count_male_GRADE_2 count_male_GRADE_3 count_male_GRADE_4
## 1: NA NA NA
## 2: 47.000000 46.000000 34.000000
## 3: 36.000000 27.000000 s
## 4: 45.000000 56.000000 52.000000
## 5: 45.000000 35.000000 45.000000
## ---
## 164: 66.000000 43.000000 56.000000
## 165: 63.000000 61.000000 59.000000
## 166: 44.000000 45.000000 39.000000
## 167: 79.000000 71.000000 57.000000
## 168: NA NA NA
## count_male_GRADE_5 count_male_GRADE_6
## 1: s 74.000000
## 2: 41.000000 35.000000
## 3: 32.000000 27.000000
## 4: 49.000000 NA
## 5: 30.000000 29.000000
## ---
## 164: 60.000000 53.000000
## 165: 71.000000 64.000000
## 166: 37.000000 37.000000
## 167: 78.000000 NA
## 168: NA s
Wide to Long: melt
As often as we’re given data in long form that we prefer in wide form, we’re presented with data in wide form that we’d prefer in long form. Stata uses similar terminology for both – reshape wide for the former and reshape long for the latter, but R uses a different word – dcast for the former and melt for the latter. I picture the data set made of wax and melting away into a longer form.
melt also takes three main arguments – data (same as before), id.vars (same as LHS above), and measure.vars (basically RHS above). measure.vars can be a bit tricky, and there are a few ways to use it. First we can specify the column names explicitly. Second, we can specify a pattern matched by the names of the columns we’d like. Last, we can use column numbers (this isn’t recommended, so I won’t demonstrate it). This gets slightly more complicated when we want to split into more than one new column.
Let’s try this with our school_gender data. We’ll stack the genders on top of each other and create a single column for student counts, then another for percentages (basically doubling the number of rows in the table).
#First, using explicit column names
melt(school_gender,
id.vars = c("school_id", "grade"),
measure.vars = c("count_male", "count_female"))
## school_id grade variable value
## 1: 101 GRADE_9 count_male 130.000000
## 2: 101 GRADE_10 count_male 111.000000
## 3: 101 GRADE_11 count_male 94.000000
## 4: 101 GRADE_12 count_male 78.000000
## 5: 101 ALL GRADES count_male 413.000000
## ---
## 3328: 878 GRADE_9 count_female s
## 3329: 878 GRADE_10 count_female 43.000000
## 3330: 878 GRADE_11 count_female 47.000000
## 3331: 878 GRADE_12 count_female 41.000000
## 3332: 878 ALL GRADES count_female 191.000000
#It's usually more concise to use the patterns
# convenience "function"; the syntax for this
# generally involves regular expressions, so
# I'll hold off on demonstrating the full
# power of this approach for now.
melt(school_gender,
id.vars = c("school_id", "grade"),
measure.vars = patterns("count"))
## school_id grade variable value
## 1: 101 GRADE_9 count_female 70.000000
## 2: 101 GRADE_10 count_female 86.000000
## 3: 101 GRADE_11 count_female 78.000000
## 4: 101 GRADE_12 count_female 63.000000
## 5: 101 ALL GRADES count_female 297.000000
## ---
## 3328: 878 GRADE_9 count_male s
## 3329: 878 GRADE_10 count_male 26.000000
## 3330: 878 GRADE_11 count_male 32.000000
## 3331: 878 GRADE_12 count_male 23.000000
## 3332: 878 ALL GRADES count_male 120.000000
#A bit more careful about the output to make
# it a bit more understandable
melt(school_gender,
id.vars = c("school_id", "grade"),
measure.vars = patterns("count"),
variable.name = "gender",
value.name = "count")
## school_id grade gender count
## 1: 101 GRADE_9 count_female 70.000000
## 2: 101 GRADE_10 count_female 86.000000
## 3: 101 GRADE_11 count_female 78.000000
## 4: 101 GRADE_12 count_female 63.000000
## 5: 101 ALL GRADES count_female 297.000000
## ---
## 3328: 878 GRADE_9 count_male s
## 3329: 878 GRADE_10 count_male 26.000000
## 3330: 878 GRADE_11 count_male 32.000000
## 3331: 878 GRADE_12 count_male 23.000000
## 3332: 878 ALL GRADES count_male 120.000000
We can extend this approach to handle reshaping both count and pct like so:
#Admittedly a bit tough to tell whether variable = 1
# corresponds to Male or Female
melt(school_gender,
id.vars = c("school_id", "grade"),
measure.vars = patterns("count", "pct"))
## school_id grade variable value1 value2
## 1: 101 GRADE_9 1 70.000000 0.350000
## 2: 101 GRADE_10 1 86.000000 0.436548
## 3: 101 GRADE_11 1 78.000000 0.453488
## 4: 101 GRADE_12 1 63.000000 0.446809
## 5: 101 ALL GRADES 1 297.000000 0.418310
## ---
## 3328: 878 GRADE_9 2 s s
## 3329: 878 GRADE_10 2 26.000000 0.376812
## 3330: 878 GRADE_11 2 32.000000 0.405063
## 3331: 878 GRADE_12 2 23.000000 0.359375
## 3332: 878 ALL GRADES 2 120.000000 0.385852
#If we want to use the explicitly-name approach
# of before, we have to keep all associated columns
# as one item of a list -- here, the first
# item of the list is for counts, and the second
# item of the list is for percentages.
#Advantage: we know for sure that count_male
# and pct_male correspond to variable = 1
melt(school_gender,
id.vars = c("school_id", "grade"),
measure.vars = list(c("count_male", "count_female"),
c("pct_male", "pct_female")))
## school_id grade variable value1 value2
## 1: 101 GRADE_9 1 130.000000 0.650000
## 2: 101 GRADE_10 1 111.000000 0.563452
## 3: 101 GRADE_11 1 94.000000 0.546512
## 4: 101 GRADE_12 1 78.000000 0.553191
## 5: 101 ALL GRADES 1 413.000000 0.581690
## ---
## 3328: 878 GRADE_9 2 s s
## 3329: 878 GRADE_10 2 43.000000 0.623188
## 3330: 878 GRADE_11 2 47.000000 0.594937
## 3331: 878 GRADE_12 2 41.000000 0.640625
## 3332: 878 ALL GRADES 2 191.000000 0.614148
#We can again specify value.name to facilitate understanding
melt(school_gender,
id.vars = c("school_id", "grade"),
measure.vars = list(c("count_male", "count_female"),
c("pct_male", "pct_female")),
value.name = c("count", "pct"))
## school_id grade variable count pct
## 1: 101 GRADE_9 1 130.000000 0.650000
## 2: 101 GRADE_10 1 111.000000 0.563452
## 3: 101 GRADE_11 1 94.000000 0.546512
## 4: 101 GRADE_12 1 78.000000 0.553191
## 5: 101 ALL GRADES 1 413.000000 0.581690
## ---
## 3328: 878 GRADE_9 2 s s
## 3329: 878 GRADE_10 2 43.000000 0.623188
## 3330: 878 GRADE_11 2 47.000000 0.594937
## 3331: 878 GRADE_12 2 41.000000 0.640625
## 3332: 878 ALL GRADES 2 191.000000 0.614148
String Data / regex
String data is ubiquitous. And it is ubiquitously messy. Typos, transpositions, inconsistent abbreviations, strange character encodings, arbitrary punctuation… there’s no end to the nightmares that can be caused to a data analyst forced to work with string data.
The most powerful tool that a data analyst can have in their belt to attack such data is a facility with regular expressions.
Basically, regular expressions are a language for talking to and manipulating string data. Have you ever wondered how many words in English fit some certain pattern? Regular expressions are your best friend.
What are all the words in the English language that have the five vowels in order? I used the Regex Dictionary to search for the pattern a.*e.*i.*o.*u.* (we’ll cover what this means momentarily) and got the following list:
abstemious, adventitious, amentiferous, anemophilous, arenicolous, argentiferous, arsenious, arteriovenous, autoecious, cavernicolous, facetious, garnetiferous, sacrilegious, theater-in-the-round
Just like that, instantly! Regular expressions is basically a programming language in and of itself. They can be accessed through most other programs – Stata, Excel, SAS, MATLAB, Python, you name it. Here we’ll obviously focus on their use in R.
To just introduce some of the most common things used in regex, I’ll just do a bunch of quick examples demonstrating a certain type of pattern on a single string; then at the end we’ll quickly apply this to some of our school data.
gsub, grep, grepl, strsplit
These four are the workhorse functions of using regular expressions in R. They each serve a distinct purpose, and I’ll try and use all four in this brief overview a couple of different times.
gsub is used for substitution of patterns. Want to delete all numbers? Cut certain strings from names? This is your friend. A quick side note on etymology – the “g” in gsub stands for global. There is a rarely-used alternative to gsub, sub, which will replace only the first match of the supplied pattern… I’ve only ever used it once, just something to keep in mind.
grep (an obscure acronym for global regular expression print, though I only ever say “grep”) has two purposes. The first identifies if elements of a string match a pattern, and then returns the indices of the vector corresponding to matches. The second (with setting the argument value = TRUE) returns the matched strings themselves.
grepl (the “l” stands for logical) is used to simply test for existence of the pattern in strings. If the pattern is matched, we get TRUE, otherwise FALSE.
Simple, Exact Substring Matching
The most common regular expression is simply an exact sequence of letters. Suppose we wanted to find all the people named Bob. Then we might do:
strings <- c("Bob Dylan", "Johnny Cash", "Bob Marley", "Janis Joplin")
#note the order of arguments!!
# this is a pet peeve of mine, as I'd expect
# strings to go first.
grep("Bob", strings, value = TRUE)
## [1] "Bob Dylan" "Bob Marley"
firsts <- c("Bob", "Johnny", "Bob", "Janis")
lasts <- c("Dylan", "Cash", "Marley", "Joplin")
#We might use grep without value = TRUE to
# find the last names of all Bobs:
grep("Bob", strings)
## [1] 1 3
lasts[grep("Bob", strings)]
## [1] "Dylan" "Marley"
mons <- c("January", "February", "March", "April",
"May", "June", "July", "August",
"September", "October", "November", "December")
#Which months can we eat oysters?
grep("r", mons)
## [1] 1 2 3 4 9 10 11 12
Beginning/End Anchors: ^ and $
Sometimes we want to be sure we’re matching a pattern from the beginning or end of a string. Regex uses two symbols to represent these: ^ (beginning) and $ (end).
Let’s try it out:
strings <- c("Bob Dylan", "Johnny Cash",
"Bob Marley", "Janis Joplin",
"Billy-Bob Thornton")
#Just matching Bob will give us the last
# string as well:
grep("Bob", strings, value = TRUE)
## [1] "Bob Dylan" "Bob Marley" "Billy-Bob Thornton"
#To exclude this, we can force Bob
# to have come from the beginning
# of the string:
grep("^Bob", strings, value = TRUE)
## [1] "Bob Dylan" "Bob Marley"
states <-
c("Alabama", "Alaska", "Arizona", "Arkansas",
"California", "Colorado", "Connecticut",
"Delaware", "Florida", "Georgia", "Hawaii",
"Idaho", "Illinois", "Indiana", "Iowa",
"Kansas", "Kentucky", "Louisiana", "Maine",
"Maryland", "Massachusetts", "Michigan",
"Minnesota", "Mississippi", "Missouri",
"Montana", "Nebraska", "Nevada", "New Hampshire",
"New Jersey", "New Mexico", "New York",
"North Carolina", "North Dakota", "Ohio",
"Oklahoma", "Oregon", "Pennsylvania", "Rhode Island",
"South Carolina", "South Dakota", "Tennessee",
"Texas", "Utah", "Vermont", "Virginia", "Washington",
"West Virginia", "Wisconsin", "Wyoming")
#Which states start with New?
grep("^New", states, value = TRUE)
## [1] "New Hampshire" "New Jersey" "New Mexico" "New York"
#Which states DON'T end with a?
states[!grepl("a$", states)]
## [1] "Arkansas" "Colorado" "Connecticut" "Delaware"
## [5] "Hawaii" "Idaho" "Illinois" "Kansas"
## [9] "Kentucky" "Maine" "Maryland" "Massachusetts"
## [13] "Michigan" "Mississippi" "Missouri" "New Hampshire"
## [17] "New Jersey" "New Mexico" "New York" "Ohio"
## [21] "Oregon" "Rhode Island" "Tennessee" "Texas"
## [25] "Utah" "Vermont" "Washington" "Wisconsin"
## [29] "Wyoming"
#We could also have used another
# argument of grep, invert,
# to accomplish the same thing as
# x[!grepl("pattern", x)]
grep("a$", states, value = TRUE, invert = TRUE)
## [1] "Arkansas" "Colorado" "Connecticut" "Delaware"
## [5] "Hawaii" "Idaho" "Illinois" "Kansas"
## [9] "Kentucky" "Maine" "Maryland" "Massachusetts"
## [13] "Michigan" "Mississippi" "Missouri" "New Hampshire"
## [17] "New Jersey" "New Mexico" "New York" "Ohio"
## [21] "Oregon" "Rhode Island" "Tennessee" "Texas"
## [25] "Utah" "Vermont" "Washington" "Wisconsin"
## [29] "Wyoming"
Multiple Possibilities: |
As typically in programming, we can use | to mean “or”. | separates patterns. Let’s try it out:
#Let's get states starting with directional names
grep("North|South|East|West", states, value = TRUE)
## [1] "North Carolina" "North Dakota" "South Carolina" "South Dakota"
## [5] "West Virginia"
White Space: \\s
White space comes in many forms. Obviously the most common is a simple space. But also included in this are other sorts of blank characters that may show up unexpectedly in our data, like tabs (\t), newlines (\n), a carriage return (\r), a form feed (\f), etc. (like I said, string data is messy).
In regex in R, \\s matches all of these (those who have seen regex before should note that we need two backslashes because the backslash itself needs to be escaped in an R string). For those interested in web scraping, unfortunately, it appears that \\s does not match ; we’ll return to this later.
#Which states have a space of any sort?
grep("\\s", states, value = TRUE)
## [1] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [5] "North Carolina" "North Dakota" "Rhode Island" "South Carolina"
## [9] "South Dakota" "West Virginia"
Wildcard: .
Sometimes, we don’t care what character is in a certain place, just that it’s there. This type of pattern is called a wildcard, and it’s denoted in regex by ..
Actually, it’s rare to use it by itself, but it’ll come up a lot when we have more tools. The only thing I can think of for this alone is cheating on crosswords. Suppose we wanted to fill in a crossword answer that had the pattern L _ V _ _ _ _ _ _. As long as we know it’s one word, we can go to the Regex dictionary and search the pattern: l.v...... and we’ll get the full list of possible answers: lava-lava, lavaliere, leviathan, Levitical, Leviticus, liverleaf, liverwort, liveryman, livestock. We’ll use this more soon.
Quantifiers: *, +, ?, {m(, n)}
The most common appearance of . from the last section is as .*. We saw this when I found all the pan-vowel words in the introduction to this section.
* is a quantifier – it quantifies the piece of the pattern directly before it. In particular, * says to match the thing before it 0 or more times. So .* means “match anything at least 0 times.” So the regex in the introduction, a.*e.*i.*o.*u means match “‘a’ followed by anything any number of times followed by ‘e’ followed by anything any number of times followed by ‘i’ followed by anything any number of times followed by ‘o’ followed by anything any number of times followed by ‘u’”.
Here are all the possible quantifiers:
*: Match at least 0 times
+: Match at least 1 time
?: Match exactly 0 or 1 time.
{m}: Match exactly m times.
{m,n}: Match between m and n times, inclusive. If n is blank, it’s at least m times.
Some demonstration:
#What states have a doubled s?
grep("s{2}", states, value = TRUE)
## [1] "Massachusetts" "Mississippi" "Missouri" "Tennessee"
#Which states have an a in them
# between the first and the last letter?
# (first letter is excluded because
# we're only looking for lowercase a)
grep("a.+", states, value = TRUE)
## [1] "Alabama" "Alaska" "Arkansas" "California"
## [5] "Colorado" "Delaware" "Hawaii" "Idaho"
## [9] "Indiana" "Kansas" "Louisiana" "Maine"
## [13] "Maryland" "Massachusetts" "Michigan" "Montana"
## [17] "Nebraska" "Nevada" "New Hampshire" "North Carolina"
## [21] "North Dakota" "Oklahoma" "Pennsylvania" "Rhode Island"
## [25] "South Carolina" "South Dakota" "Texas" "Utah"
## [29] "Washington"
flips <-
c("HTHHHTHHHTH",
"THTTHTTHHTH",
"THTHTHHTHHHT",
"THTHTTTHTTHTTT")
#Which sequences of coin flips had
# at least 3 heads in a row?
grepl("H{3, }", flips)
## [1] TRUE FALSE TRUE FALSE
#Let's match the middle initial K, possibly
# followed by some punctuation
grepl("\\sK.?", c("Johnson, Mary K",
"Hobbes, Calvin K.",
"Martin, George R. R."))
## [1] TRUE TRUE FALSE
Group Matching: []
[] is used for a less inclusive version of .. We specify the characters that we’re OK with inside of [] and they’ll be matched. Most commonly, this is used for matching any letter or any number. To do this, we use - inside [] to mean “through”, like [A-Z] (A through Z):
#Let's match ANY middle initial
nms <- c("Johnson, Mary K",
"Hobbes, Calvin K.",
"Martin, George R. R.", "Obama, Barack",
"Prince, The Artist Formerly Known As")
grepl("\\s[A-Z]\\.?$", nms)
## [1] TRUE TRUE TRUE FALSE FALSE
#If we want to find anyone with a period,
# we have to be careful. This won't work:
grepl(".", nms)
## [1] TRUE TRUE TRUE TRUE TRUE
#We CAN put a period inside a group:
grepl("[.]", nms)
## [1] FALSE TRUE TRUE FALSE FALSE
#OR we can escape it with \\:
grepl("\\.", nms)
## [1] FALSE TRUE TRUE FALSE FALSE
#Let's remove all digits from some addresses
gsub("[0-9]", "", c("3718 Locust Walk",
"219 S 41st Street",
"1600 Pennsylvania Ave NW"))
## [1] " Locust Walk" " S st Street" " Pennsylvania Ave NW"
#Whoops, we took away too much there!
# We also missed the leading white space
# Let's try again:
gsub("^[0-9]*\\s*", "", c("3718 Locust Walk",
"219 S 41st Street",
"1600 Pennsylvania Ave NW"))
## [1] "Locust Walk" "S 41st Street" "Pennsylvania Ave NW"
Specials in Grouping: [:punct:], [:alnum:]
Also within [] we can refer to certain sets of characters with special shorthands. The two most useful are [:punct:] to refer to all punctuation (well, most – specifically, [:punct:] matches all of [-[]\;',./!@#%&*()_{}::"?]), and [:alnum:] to refer to all alphanumeric characters (i.e., [:alnum:] is the same as [A-Za-z0-9]). (There are also [:upper:], [:lower:], [:digit:], and [:alpha:], but all of these are longer to write out than just writing A-Z, a-z, 0-9, and a-zA-Z, respectively)
#get rid of all punctuation
gsub("[[:punct:]]", "", "John C. Willington, Jr.")
## [1] "John C Willington Jr"
#get rid of all alphanumeric characters
gsub("[[:alnum:]]", "", "John C. Willington, Jr.")
## [1] " . , ."
Negative Group Matching: [^]
We can specify a group of characters not to match by starting the matching group with [^ instead of [.
#Keep only digits
gsub("[^0-9]", "", c("3718 Locust Walk",
"219 S 41st Street",
"1600 Pennsylvania Ave NW"))
## [1] "3718" "21941" "1600"
A Quick Data Exercise
Remember how we found out that Masterman had a strange job description for its teachers? The reason we were susceptible to this is that we relied on finding the most common job description to identify teachers. This was bound to fail if a small, specialized school like Masterman uses some alternative description on a small scale.
A better approach to tackling this problem would have been to use regex. After some brief inspection of our salaries data, we could have seen that TEACHER appears to be a constant in the job description of teachers. We can start from there and winnow down our options to only identify the teachers we’re interested in, but also be more inclusive than just including TEACHER,FULL TIME:
#putting grepl in the first argument means
# we'll only get rows for which there is
# a match to the pattern
salaries[grepl("TEACHER", title_description),
.N, by = title_description][order(-N)]
## title_description N
## 1: TEACHER,FULL TIME 6652
## 2: TEACHER,SPEC EDUCATION 1314
## 3: TEACHER-EXTRA CURR/STAFF DEVEL 336
## 4: TEACHER ASST,PKHS 109
## 5: TEACHER,DEMONSTRATION 83
## 6: TEACHER,HEAD,PKHS/BRIGHT 38
## 7: RETIRED TEACHER, PER DIEM SUB 19
## 8: CONSULTING TEACHER 19
## 9: TEACHER,SPEC ASSIGN,12 MO 16
## 10: TEACHER,DEMONSTRATION,SPEC ED 16
## 11: RETIRED TEACHER,PER DIEM SP ED 16
## 12: TEACHER,LONG TERM SUBSTITUTE 6
## 13: TEACHER,NON-CERT HRLY 4
## 14: TEACHER, PER DIEM SUBSTITUTE 1
## 15: MGR TEACHER COACHES 1
On a smaller scale, it’s a lot easier to piece through every possible TEACHER-related job title. We can see Masterman’s TEACHER,DEMONSTRATION right there as 5th-most-common. Suppose we had wanted to keep only TEACHER,FULL TIME and TEACHER,DEMONSTRATION for our teachers data set. We might have subsetted like this:
#(FULL|DEMO) matches ONLY
# TEACHER,FULL or TEACHER,DEMO;
# the rest of the regex is to exclude
# TEACHER,DEMONSTRATION,SPEC ED, which
# is distinguished by having a comma
# later in the string, so we require
# our match to have all the remaining
# characters after (FULL|DEMO) to
# NOT be a comma.
salaries[grepl("TEACHER,(FULL|DEMO)[^,]*$", title_description),
unique(title_description)]
## [1] "TEACHER,FULL TIME" "TEACHER,DEMONSTRATION"
What about if we only wanted to compare the average wage of PFT employees vs. non-PFT employees? PFT has several arms/classes of employee employed by the district, so we might not want to look solely at PFT-TEACHER. We can do this with regex:
#make sure we don't mix hourly employees
# since their wages are quoted by the hour
salaries[pay_rate_type == "SALARIED",
.(avg_wage = mean(pay_rate)),
by = .(pft = grepl("^PFT", type_of_representation))]
## pft avg_wage
## 1: FALSE 35481.39
## 2: TRUE 58571.78
Lastly, we can extract only employees working at schools (as opposed to offices) with something like:
salaries[grepl("SCHOOL", organization_level),
.N, organization_level][order(-N)]
## organization_level N
## 1: ELEMENTARY SCHOOL 9667
## 2: HIGH SCHOOL 3101
## 3: MIDDLE SCHOOL 954
## 4: TRANSITION / OVERAGE SCHOOL 141
## 5: CHARTER SCHOOLS 12
## 6: NON PUBLIC SCHOOL 1